Introduction
Introductory Statistics with R (ISwR) was written by Peter Dalgaard. This is a fine book that is suitable for use in a course in applied multivariate statistics, such as Boston University’s CAS MA 684.
The book’s chapters are:
- Basics
- Probability and Distributions
- Descriptive statistics and graphics
- One- and two-sample tests
- Regression and correlation
- ANOVA and Kruskal-Wallis
- Tabular data
- Power and the computation of sample size
- Multiple regression
- Linear models
- Logistic regression
- Survival analysis
This page stores my notes as I work through the chapters of this book. I am using R 2.2.1 and R GUI 1.14 for Mac OS X.
The ISwR package
The ISwR package contains the data used in the examples and exercises of the book.
Installing ISwR
Here’s how to install the ISwR package using R for Mac OS X:
- In the R GUI, click on the Packages & Data menu and select Package Installer.
- In the R Package Installer window, click on the popup menu at the upper left of the window and select CRAN (binaries), then click on the Get List button.
- R may prompt you at this point to choose a CRAN mirror site.
- In the package list, select ISwR, then click on the Install Selected button at the lower right. R will download and install the ISwR package on your system. When the download is complete, the value in the Installed Version column is updated.
- The ISwR package requires the survival package. If the survival package is not already installed, install it the same way you installed ISwR.
- Close the R Package Installer window.
Loading ISwR
Here is the command for loading the ISwR package in an R session so you can use it:
library( package = ISwR )
For information about the contents of the ISwR package, enter the following command:
help( package = ISwR )
Chapter 1
§ 1.1
You can copy the code below and paste it directly into R to see the results. But you’ll get a better feel for what’s going on, and you’ll begin to internalize the R commands, by typing in the code yourself.
## Create a vector containing weights.
weight <- c( 60, 72, 57, 90, 95, 72 )
## Create a vector containing heights.
height <- c( 1.75, 1.80, 1.65, 1.90, 1.74, 1.91 )
## Create a derived value, bmi, using vectorized math.
bmi <- weight / ( height ^ 2 )
print( bmi )
## Calculate the mean of the weights and view it.
xbar <- sum( weight ) / length( weight )
print( xbar )
## Calculate the standard deviation of the weights, using the traditional
## formula, and view it.
stddev <- sqrt( sum( ( weight - xbar ) ^ 2 ) / ( length( weight ) - 1 ) )
print( stddev )
## Calculate the standard deviation of the weights using a formula that
## doesn't require pre-calculating the mean. (This formula is not presented
## in ISwR.)
variance <-
( sum( weight ^ 2 ) - sum( weight ) ^ 2 / length( weight ) ) /
( length( weight ) - 1 )
stddev <- sqrt( variance )
print( stddev )
## Calculate the mean of the weights using R's built-in function.
xbar <- mean( weight )
print( xbar )
## Calculate the standard deviation of the weights using R's built-in function.
stddev <- sd( weight )
## Carry out a t test and view the result.
bmi.t.test <- t.test( x = bmi, mu = 22.5, alternative = "two.sided" )
print( bmi.t.test )
## Plot the points.
plot( x = height, y = weight, pch = 2 )
## Calculate the coordinates of the predicted weights from the heights
## using bmi = 22.5, and plot the curve by drawing lines between the
## predicted points.
hh <- seq( from = 1.65, to = 1.90, by = 0.05 ) # heights
ww <- 22.5 * hh ^ 2 # predicted weights
lines( x = hh, y = ww )
§ 1.2
The following objects, mostly functions, are described in this section:
:operator$operatorapply()args()as.numeric()c()cbind()colnames()dim()factor()is.na()lapply()LETTERSvectorlettersvectorlevels()list()matrix()month.abbvectormonth.namevectorNANULLorder()ordered()rep()rbind()rownames()sapply()seq()sort()split()subset()t()tapply()transform()
I found indexing difficult to master, and it still confuses me at times. Here are some examples:
## Create a vector and get the 5th item from it. x <- seq( from = 3, to = 12, by = 1 ) x[ 5 ] ## 7 ## Create a matrix and get the item in the 2nd row, 3rd column. m <- as.matrix( 1:12, nrow = 3, ncol = 4, byrow = TRUE ) m m[ 2, 3 ] ## 7 ## Create a list with three components; the third component, z, is shorter. ll <- list( x = 1:12, y = 4:15, z = c( "A", "B", "C" ) ) ll ## Return a list containing the 2nd component of list ll. ll[ 2 ] data.class( ll[ 2 ] ) ## "list" ## Return the 2nd component of list ll. Since the 2nd component of the list ll ## is a vector, this returns the vector. ll[[ 2 ]] data.class( ll[[ 2 ]] ) ## "numeric" ## Return a list containing the 2nd and 3rd components of list ll. ll[ c( 2, 3 ) ] ## Return component "y" of list ll. ll$y ## Same result as ll[[ 2 ]] ## Return the 3rd item of the vector that is the 2nd component of list ll. ll[[ 2 ]][ 3 ] ## 6 ll$y[ 3 ] ## identical result ll[[ c( 2, 3 ) ]] ## identical result ## A data.frame is a list where all components have equal length. This ## makes it easy to extract elements at specific row and column positions. ## Note what happens to component 3 when ll is converted to a data.frame. d.f <- data.frame( ll ) d.f ## Extract the element at row 3, column 2. d.f[ 3, 2 ] ## 6 ## Return a data.frame containing component (column) 3 of d.f. d.f[ 3 ] data.class( d.f[ 3 ] ) ## "data.frame" ## Return a data.frame containing the elements of row 3 of d.f. d.f[ 3, ] data.class( d.f[ 3, ] ) ## "data.frame" ## Return a vector containing the elements of column 2 of d.f. d.f[ , 2 ] data.class( d.f[ , 2 ] ) ## "numeric"
Conditional selection from data.frames is easy using the subset() function.
A command that doesn't use subset() but returns the same result is
also shown.
d.f <- data.frame( x = 1:12, y = 4:15, z = LETTERS[ 1:12 ] ) ## Return a data.frame containing the rows of d.f where the values of column 2 ## are greater than 8. These two commands return identical results. subset( d.f, y > 8 ) d.f[ d.f$y > 8, ] ## Return a data.frame where the values of column 2 are greater than 8 and ## less than or equal to 10. These two commands return identical results. subset( d.f, y > 8 & y <= 10 ) d.f[ d.f$y > 8 & d.f$y <= 10, ]
Here are some simple examples of using lapply(), sapply(),
apply(), and tapply():
d.f <- data.frame( x = rnorm( n = 5 ), y = rpois( n = 5, lambda = 2 ) )
lapply( d.f, mean ) ## column-wise (variable) means, returned as a list
sapply( d.f, mean ) ## column-wise (variable) means, returned as a vector
apply( d.f, 1, min ) ## row-wise min values, returned as a vector
d.f2 <-
data.frame(
x = rnorm( 10 ),
y = rnorm( 10 ),
z = rep( c( "A", "B" ), length = 10 ) )
d.f2
tapply( d.f2$x, d.f2$z, median )
§ 1.3
I have added the code for Figure 1.5, the layout of a standard plot, to my collection of scripts.
§ 1.4
This section on programming is too advanced for this point in the book, but it contains much useful information.
§ 1.5
I have no notes for this section.
§ 1.6
I usually use Perl to format my data before loading it into R. This means I rarely
need to use anything other than read.table().
Chapter 5
§ 5.1
The example is easier to understand if a model object, thuesen.lm,
is created and used in the subsequent steps. The model object contains the data
and the results encapsulated within it; this is a powerful feature of the S
programming language.
## Load the data.
library( package = ISwR )
data( thuesen )
## Perform the analysis.
thuesen.lm <-
lm(
formula = short.velocity ~ blood.glucose,
data = thuesen )
## Show a summary of the coefficients.
options( show.signif.stars = FALSE )
summary( thuesen.lm )
## View the analysis of variance table.
anova( thuesen.lm )
## Plot the data with the regression line.
## This creates Figure 5.1 of the book.
get( getOption( "device" ) )()
plot( short.velocity ~ blood.glucose, data = thuesen )
abline( thuesen.lm )
## For didactic purposes, compute the p-value for the t test using the
## observed t, 2.101, and 21 degrees of freedom.
## pt( q = 2.101, df = 21 ) calculates P( t <= 2.101, df = 21 ).
## 1 - pt( q = 2.101, df = 21 ) calculates P( t >= 2.101, df = 21 ).
## 2 * ( 1 - pt( q = 2.101, df = 21 ) ) calculates
## P( |t| >= 2.101, df = 21 ), a two-tail test.
p.value <- 2 * ( 1 - pt( q = 2.101, df = 21 ) )
print( p.value )
## For didactic purposes, compute the p-value for the F test using the
## observed F, 4.414, with 1 numerator degree of freedom and 21 denominator
## degrees of freedom.
p.value <- 1 - pf( q = 4.414, df1 = 1, df2 = 21 )
print( p.value )
## The F value is computed by the summary() function.
## This example shows how to extract the F value and the degrees of
## freedom to compute the p-value.
thuesen.lm.summary <- summary( thuesen.lm )
Prob <-
pf(
q = thuesen.lm.summary$fstatistic[ "value" ],
df1 = thuesen.lm.summary$fstatistic[ "numdf" ],
df2 = thuesen.lm.summary$fstatistic[ "dendf" ] )
p.value <- 1 - Prob
print( p.value )
§ 5.2
There are many ways to handle missing data. I am bewildered by all the options. Here I amplify the book’s discussion of this problem.
Example 1 below handles missing data implicitly. The default behavior of lm()
is to handle missing data based on the value of getOption("na.action"), the
“factory-fresh” value of which is na.omit. If we don’t know that this option
has the desired value, we can set it using the options() function before
calling lm().
The result of handling missing data using na.omit is that the vector
returned by fitted(thuesen.lm1) is one element shorter than the input vectors,
thuesen$short.velocity and thuesen$blood.glucose.
Furthermore, the lm object’s copy of the data, here thuesen.lm1$model,
is a data.frame with no rows containing NA values. We can avoid
problems in unequal vector lengths by using the data stored within the lm object.
## Example 1: implicit na.omit
## Load the data.
library( package = ISwR )
data( thuesen )
## Perform the analysis.
options( na.action = na.omit )
thuesen.lm1 <-
lm(
formula = short.velocity ~ blood.glucose,
data = thuesen )
## Show a summary of the coefficients.
options( show.signif.stars = FALSE )
summary( thuesen.lm1 )
## Returns 23:
length( fitted( thuesen.lm1 ) )
## Returns 24:
length( thuesen$blood.glucose )
## Returns FALSE:
length( fitted( thuesen.lm1 ) ) == length( thuesen$blood.glucose )
## Returns TRUE:
length( fitted( thuesen.lm1 ) ) == length( thuesen.lm1$model$blood.glucose )
Example 2 explicitly omits missing data by explicitly setting the using the
na.action argument of the lm() function to na.omit.
The vector lengths of the original data.frame are still longer than the vector returned by
fitted().
## Example 2: explicit na.omit
## Load the data.
library( package = ISwR )
data( thuesen )
## Perform the analysis. Specify the na.action argument.
thuesen.lm2 <-
lm(
formula = short.velocity ~ blood.glucose,
data = thuesen,
na.action = na.omit )
## Show a summary of the coefficients.
options( show.signif.stars = FALSE )
summary( thuesen.lm2 )
## Returns 23:
length( fitted( thuesen.lm2 ) )
## Returns 24:
length( thuesen$blood.glucose )
## Returns FALSE:
length( fitted( thuesen.lm2 ) ) == length( thuesen$blood.glucose )
## Returns TRUE:
length( fitted( thuesen.lm2 ) ) == length( thuesen.lm2$model$blood.glucose )
Example 3 explicitly excludes missing data by setting the
na.action argument of the lm() function to
na.exclude. Rows containing NA tracked by the lm object, but
they are not preserved in the objects model data.frame.
The result is that the length of the vector returned by fitted() is
now the same length as the vectors from the original data.frame. Note that the value of
fitted( thuesen.lm3 )[ 16 ] is NA.
## Example 3: explicit na.exclude
## Load the data.
library( package = ISwR )
data( thuesen )
## Perform the analysis. Specify the na.action argument.
thuesen.lm3 <-
lm(
formula = short.velocity ~ blood.glucose,
data = thuesen,
na.action = na.exclude )
## Show a summary of the coefficients.
options( show.signif.stars = FALSE )
summary( thuesen.lm3 )
## Returns 24:
length( fitted( thuesen.lm3 ) )
## Returns 23:
length( thuesen$blood.glucose )
## Returns TRUE:
length( fitted( thuesen.lm3 ) ) == length( thuesen$blood.glucose )
## Returns FALSE:
length( fitted( thuesen.lm3 ) ) == length( thuesen.lm3$model$blood.glucose )
Example 4 handles missing data by creating a data.frame that contains no missing data.
We create two copies of the thuesen data set from which we have removed rows
containing NA values through the use of complete.cases().
Vector lengths of the input data.frame, here thuesen2, are 23, equal to the
length of the vector produced by fitted().
## Example 4: complete.cases
## Load the data.
library( package = ISwR )
data( thuesen )
## Copy the data.frame without the NA values.
## These three commands give the equivalent result.
thuesen2 <- thuesen[ complete.cases( thuesen ), ]
thuesen3 <- subset( thuesen, complete.cases( thuesen ) )
## Returns TRUE:
isTRUE( all.equal( thuesen2, thuesen3 ) )
## Perform the analysis on the data subset, here thuesen2.
thuesen2.lm <-
lm(
formula = short.velocity ~ blood.glucose,
data = thuesen2 )
## Show a summary of the coefficients.
options( show.signif.stars = FALSE )
summary( thuesen2.lm )
## Returns 23:
length( fitted( thuesen2.lm ) )
## Returns 23:
length( thuesen2$blood.glucose )
## Returns TRUE:
length( fitted( thuesen2.lm ) ) == length( thuesen2$blood.glucose )
## Returns TRUE:
length( fitted( thuesen2.lm ) ) == length( thuesen2.lm$model$blood.glucose )
In Example 5, we create another a copy of thuesen from which missing data have been removed
using the function na.omit. This function returns a data.frame with an
additional attribute, na.action, that lists which rows of the original
data.frame were removed.
## Example 5: na.omit
## Load the data.
library( package = ISwR )
data( thuesen )
## Copy the data.frame without the NA values.
## These three commands give the equivalent result.
thuesen4 <- na.omit( thuesen )
## Perform the analysis on the data subset.
thuesen4.lm <-
lm(
formula = short.velocity ~ blood.glucose,
data = thuesen4 )
## Show a summary of the coefficients.
options( show.signif.stars = FALSE )
summary( thuesen4.lm )
## Returns 23:
length( fitted( thuesen4.lm ) )
## Returns 23:
length( thuesen4$blood.glucose )
## Returns 23:
length( thuesen4.lm$model$blood.glucose )
## Returns TRUE:
length( fitted( thuesen4.lm ) ) == length( thuesen4$blood.glucose )
## Returns TRUE:
length( fitted( thuesen4.lm ) ) == length( thuesen4.lm$model$blood.glucose )
I have not decided which option I think is best. However, I am inclined to use the data stored in the model data.frame of the lm object rather than the original data in the data.frame for subsequent analysis.
Chapter 6
§ 6.1
Analyze artificial data for one-way layouts. In the first example, there is no relationship
between y and x. In the second example, there is a
relationship between y and x.
## Create artificial data in which there is no relationship between x and y.
n <- 100
x1 <- factor( c( rep( "A", n / 2 ), rep( "B", n / 2 ) ) )
y1 <- rnorm( n = n, mean = 0, sd = 1 )
xy1 <- data.frame( x = x1, y = y1 )
## Create artificial data in which there *is* a relationship between x and y
## (the mean is different for each level).
n <- 100
x2 <- factor( c( rep( "A", n / 2 ), rep( "B", n / 2 ) ) )
y2 <-
c(
rnorm( n = n / 2, mean = 0, sd = 1 ),
rnorm( n = n / 2, mean = 2, sd = 1 ) )
xy2 <- data.frame( x = x2, y = y2 )
## Plot on the default device.
## Boxplots are used by default.
## Use the same vertical scale for each plot.
get( getOption( "device" ) )()
par( mfrow = c( 1, 2 ) )
plot(
y ~ x,
data = xy1,
ylim = c( min( xy1$y, xy2$y ), max( xy1$y, xy2$y ) ) )
plot(
y ~ x,
data = xy2,
ylim = c( min( xy1$y, xy2$y ), max( xy1$y, xy2$y ) ) )
## Create and examine the models.
xy1.lm <- lm( y ~ x, data = xy1 )
summary( xy1.lm )
xy2.lm <- lm( y ~ x, data = xy2 )
summary( xy2.lm )
The output of the calls to summary summary() is:
> summary( xy1.lm )
Call:
lm(formula = y ~ x, data = xy1)
Residuals:
Min 1Q Median 3Q Max
-1.98474 -0.67045 0.06854 0.63067 2.03575
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009498 0.125063 0.076 0.940
xB 0.078829 0.176866 0.446 0.657
Residual standard error: 0.8843 on 98 degrees of freedom
Multiple R-Squared: 0.002023, Adjusted R-squared: -0.008161
F-statistic: 0.1986 on 1 and 98 DF, p-value: 0.6568
> summary( xy2.lm )
Call:
lm(formula = y ~ x, data = xy2)
Residuals:
Min 1Q Median 3Q Max
-2.171e+00 -6.399e-01 1.262e-05 5.791e-01 2.859e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06881 0.13667 0.503 0.616
xB 1.89758 0.19329 9.817 3.01e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9664 on 98 degrees of freedom
Multiple R-Squared: 0.4958, Adjusted R-squared: 0.4907
F-statistic: 96.38 on 1 and 98 DF, p-value: 3.015e-16
The graphical output is:
Box plots of artificial one-way layouts
Analyze the red.cell.folate data set. I happen to favor box plots, but
I follow the book in adding a strip plot to the figure. I jitter the points, although
this is not necessary for this data set.
## Load the data.
library( package = "ISwR" )
data( red.cell.folate )
## Look at data summaries, including a plot (which is a boxplot by default).
## Follow the example in the book by adding a stripchart.
?red.cell.folate
summary( red.cell.folate )
get( getOption( "device" ) )()
plot( folate ~ ventilation, data = red.cell.folate )
stripchart(
x = red.cell.folate$folate ~ red.cell.folate$ventilation,
method = "jitter",
jitter = 0.05,
pch = 16,
vertical = TRUE,
add = TRUE,
col = "blue" )
## Create and examine the linear model.
rcf.lm <- lm( folate ~ ventilation, data = red.cell.folate )
anova( rcf.lm )
summary( rcf.lm )
The graphical output is:
Combined box and strip plots of red.cell.folate data
The example with the juul data set is easier if you don't attach the
data set first. Then it’s not necessary to detach and attach the data set after converting
juul$tanner to a factor object.
§ 6.2
Instead of using attach() to attach the data, an alternative is to use
with(). Here is my version of the Kruskal-Wallis example:
## Load the data.
library( package = "ISwR" )
data( red.cell.folate )
## Kruskal-Wallis rank sum test:
with(
data = red.cell.folate,
expr = kruskal.test( folate ~ ventilation ) )
§ 6.3
My version of the interaction.plot() example:
## Load the data.
library( package = "ISwR" )
data( heart.rate )
## Create an interaction plot.
with(
data = heart.rate,
expr = interaction.plot(
x.factor = time,
trace.factor = subj,
response = hr,
col = rainbow( n = 9 ) ) )
The output is:
Interaction plot of heart.rate data
Chapter 9
§ 9.1
It is always a good idea to take a quick look at the data. In addition to using the
pairs() function to quickly create pairwise scatter plots for all variables,
it is also useful to call summary() and cor().
summary(cystfibr) reveals that the range of the age variable
is only 7–23 years. cor() reveals that age,
height, and weight are highly positively correlated and
rv and frc are highly positively correlated. Correlations are
easier to view using an appropriate correlation plot.
## Load the data. library( package = ISwR ) data( cystfibr ) ## View the documentation for the cystfibr data. ?cystfibr ## Get quick summaries of the data. summary( cystfibr ) options( digits = 2 ) cor( x = cystfibr, method = "spearman" )
The output from summary() and cor() is:
> summary( cystfibr )
age sex height weight
Min. : 7.00 Min. :0.00 Min. :109.0 Min. :12.90
1st Qu.:11.00 1st Qu.:0.00 1st Qu.:139.0 1st Qu.:25.10
Median :14.00 Median :0.00 Median :156.0 Median :37.20
Mean :14.48 Mean :0.44 Mean :152.8 Mean :38.40
3rd Qu.:17.00 3rd Qu.:1.00 3rd Qu.:174.0 3rd Qu.:51.10
Max. :23.00 Max. :1.00 Max. :180.0 Max. :73.80
bmp fev1 rv frc
Min. :64.00 Min. :18.00 Min. :158.0 Min. :104.0
1st Qu.:68.00 1st Qu.:26.00 1st Qu.:188.0 1st Qu.:127.0
Median :71.00 Median :33.00 Median :225.0 Median :139.0
Mean :78.28 Mean :34.72 Mean :255.2 Mean :155.4
3rd Qu.:90.00 3rd Qu.:44.00 3rd Qu.:305.0 3rd Qu.:183.0
Max. :97.00 Max. :57.00 Max. :449.0 Max. :268.0
tlc pemax
Min. : 81 Min. : 65.0
1st Qu.:101 1st Qu.: 85.0
Median :113 Median : 95.0
Mean :114 Mean :109.1
3rd Qu.:128 3rd Qu.:130.0
Max. :147 Max. :195.0
> options( digits = 2 )
> cor( x = cystfibr, method = "spearman" )
age sex height weight bmp fev1 rv frc tlc pemax
age 1.00 -0.163 0.93 0.90 0.51 0.30 -0.58 -0.72 -0.493 0.52
sex -0.16 1.000 -0.23 -0.20 -0.15 -0.54 0.26 0.15 0.056 -0.26
height 0.93 -0.235 1.00 0.96 0.57 0.43 -0.62 -0.66 -0.473 0.59
weight 0.90 -0.201 0.96 1.00 0.73 0.46 -0.70 -0.67 -0.485 0.49
bmp 0.51 -0.146 0.57 0.73 1.00 0.56 -0.69 -0.55 -0.494 0.22
fev1 0.30 -0.542 0.43 0.46 0.56 1.00 -0.68 -0.60 -0.440 0.31
rv -0.58 0.257 -0.62 -0.70 -0.69 -0.68 1.00 0.85 0.589 -0.31
frc -0.72 0.151 -0.66 -0.67 -0.55 -0.60 0.85 1.00 0.672 -0.38
tlc -0.49 0.056 -0.47 -0.48 -0.49 -0.44 0.59 0.67 1.000 -0.15
pemax 0.52 -0.258 0.59 0.49 0.22 0.31 -0.31 -0.38 -0.148 1.00
For the pairwise scatter plots created by the pairs() function of R 2.2.1 for
Mac OS X, the default font size for the axes and the labels is a little too small to be
readable when using the quartz() device. Readability of very small fonts is
improved if text smoothing is allowed for font sizes greater than 4 in the
Appearance pane of Mac OS X System Preferences.
The book’s example sets the plot parameter par(mex=0.5), but in R 2.2.1 for
Mac OS X, this has no effect on the plot.
This code produces readable results similar to Figure 9.1:
## Load the data. library( package = ISwR ) data( cystfibr ) ## Pairwise scatter plots with larger labels: pairs( cystfibr, gap = 0, cex.labels = 1.1, cex.axis = 1.1 ) ## Equivalent plot: plot( cystfibr, gap = 0, cex.labels = 1.1, cex.axis = 1.1 )
§ 9.2
Rather than using attach() to attach the data, it is easy to specify
the data frame in the call to lm(). It is useful to save the results returned
by lm() into a variable, here cystfibr.lm. This example
turns off the display of significance stars.
## Load the data.
library( package = ISwR )
data( cystfibr )
## Create a linear model and examine the significance of the variables.
cystfibr.lm <-
lm(
pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc,
data = cystfibr )
options( show.signif.stars = FALSE )
summary( cystfibr.lm )
anova( cystfibr.lm )
It is important to note that summary() gives the significance of each variable
when it is added last to the model, whereas anova() gives the
significance of each variable as it is added in order to the model. That is, the order of
the variables in the model is important for the F tests produced by anova(),
with the F tests being successive for the addition of each of the variables to the model.
Partial F tests can be carried out using anova(), although the documentation
for anova() doesn't explain how.
## Partial F test for variable tlc:
cystfibr.anova <-
anova(
lm( pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc,
data = cystfibr ),
lm(
pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc,
data = cystfibr ) )
print( cystfibr.anova )
The output is:
Analysis of Variance Table
Model 1: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc
Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc +
tlc
Res.Df RSS Df Sum of Sq F Pr(>F)
1 16 9823.7
2 15 9731.2 1 92.4 0.1424 0.7112
Notice that the p-value, 0.7112, is identical to the p-value for the t test for the
coefficient of tlc produced by the summary(cystfibr.lm) function.
That is because the tests are equivalent.
The book demonstrates a partial F test for all variables after
age. The result shows that after age has been included in the
model, none of the other variables significantly improves the model.
## Partial F test for all variables after age has been included in the model.
cystfibr.lm1 <-
lm(
pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc,
data = cystfibr )
cystfibr.lm2 <-
lm(
pemax ~ age,
data = cystfibr )
anova( cystfibr.lm2, cystfibr.lm1 )
The p-values given by the summary() function are equivalent to the p-values
generated by a partial F test for each variable when that variable is added last to the
model.
The anova() function calculates Type I (sequential) tests. The
Anova() function in package car, written by
John Fox, computes Type II and Type III
tests. See ?Anova for more information. For voluminous discussion of
Type III tests, see RSiteSearch("Type III").
§ 9.3
The manual analysis of the significance of the variables in the linear model is instructive because it incorporates prior information.
The book mentions step() but doesn't use it. The step()
function returns the linear model that has the smallest AIC. Here we use
step() to create the linear model.
## Load the data.
library( package = ISwR )
data( cystfibr )
## Calculate the initial linear model by including all variables.
cystfibr.lm <-
lm(
formula = pemax ~ age + sex + height + weight + bmp + fev1 +
rv + frc + tlc,
data = cystfibr )
## Calculate a revised linear model using step().
cystfibr.lm.step <- step( object = cystfibr.lm, direction = "both" )
## Perform a partial F test on the revised model and the original model.
anova( cystfibr.lm.step, cystfibr.lm )
## View the coefficients of the revised model.
summary( cystfibr.lm.step )
The output from the step() function is:
> cystfibr.lm.step <- step( object = cystfibr.lm, direction = "both" )
Start: AIC= 169.11
pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc +
tlc
Df Sum of Sq RSS AIC
- sex 1 37.9 9769.2 167.2
- tlc 1 92.4 9823.7 167.3
- height 1 158.3 9889.6 167.5
- age 1 181.8 9913.1 167.6
- frc 1 254.6 9985.8 167.8
- fev1 1 648.4 10379.7 168.7
- rv 1 653.8 10385.0 168.7
<none> 9731.2 169.1
- weight 1 1441.2 11172.5 170.6
- bmp 1 1480.1 11211.4 170.6
Step: AIC= 167.2
pemax ~ age + height + weight + bmp + fev1 + rv + frc + tlc
Df Sum of Sq RSS AIC
- tlc 1 115.9 9885.1 165.5
- height 1 131.2 9900.4 165.5
- age 1 145.6 9914.7 165.6
- frc 1 221.5 9990.7 165.8
- rv 1 636.2 10405.3 166.8
<none> 9769.2 167.2
- weight 1 1446.2 11215.4 168.7
- bmp 1 1474.7 11243.9 168.7
+ sex 1 37.9 9731.2 169.1
- fev1 1 1770.4 11539.6 169.4
Step: AIC= 165.5
pemax ~ age + height + weight + bmp + fev1 + rv + frc
Df Sum of Sq RSS AIC
- frc 1 133.2 10018.3 163.8
- height 1 215.8 10100.9 164.0
- age 1 252.2 10137.3 164.1
- rv 1 543.5 10428.6 164.8
<none> 9885.1 165.5
+ tlc 1 115.9 9769.2 167.2
+ sex 1 61.4 9823.7 167.3
- fev1 1 1727.4 11612.5 167.5
- weight 1 2132.5 12017.6 168.4
- bmp 1 2354.3 12239.4 168.8
Step: AIC= 163.83
pemax ~ age + height + weight + bmp + fev1 + rv
Df Sum of Sq RSS AIC
- age 1 145.3 10163.6 162.2
- height 1 158.2 10176.5 162.2
- rv 1 568.1 10586.3 163.2
<none> 10018.3 163.8
+ frc 1 133.2 9885.1 165.5
+ tlc 1 27.6 9990.7 165.8
+ sex 1 0.04362 10018.2 165.8
- weight 1 2027.2 12045.5 166.4
- bmp 1 2324.1 12342.3 167.0
- fev1 1 2851.2 12869.5 168.1
Step: AIC= 162.19
pemax ~ height + weight + bmp + fev1 + rv
Df Sum of Sq RSS AIC
- height 1 191.0 10354.6 160.7
- rv 1 829.0 10992.6 162.2
<none> 10163.6 162.2
+ age 1 145.3 10018.3 163.8
+ tlc 1 102.3 10061.3 163.9
+ frc 1 26.3 10137.3 164.1
+ sex 1 0.6 10163.0 164.2
- weight 1 2603.5 12767.0 165.9
- bmp 1 2743.5 12907.1 166.2
- fev1 1 3210.9 13374.5 167.1
Step: AIC= 160.66
pemax ~ weight + bmp + fev1 + rv
Df Sum of Sq RSS AIC
<none> 10354.6 160.7
- rv 1 1183.6 11538.2 161.4
+ tlc 1 197.1 10157.5 162.2
+ height 1 191.0 10163.6 162.2
+ age 1 178.1 10176.5 162.2
+ frc 1 3.4 10351.2 162.6
+ sex 1 2.4 10352.2 162.7
- bmp 1 3072.6 13427.2 165.2
- fev1 1 3717.1 14071.7 166.3
- weight 1 10930.2 21284.8 176.7
The output from the anova() and summary() functions is:
> anova( cystfibr.lm.step, cystfibr.lm )
Analysis of Variance Table
Model 1: pemax ~ weight + bmp + fev1 + rv
Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc +
tlc
Res.Df RSS Df Sum of Sq F Pr(>F)
1 20 10354.6
2 15 9731.2 5 623.4 0.1922 0.9609
> summary( cystfibr.lm.step )
Call:
lm(formula = pemax ~ weight + bmp + fev1 + rv, data = cystfibr)
Residuals:
Min 1Q Median 3Q Max
-39.77 -11.74 4.33 15.66 35.07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.94669 53.27673 1.200 0.244057
weight 1.74891 0.38063 4.595 0.000175
bmp -1.37724 0.56534 -2.436 0.024322
fev1 1.54770 0.57761 2.679 0.014410
rv 0.12572 0.08315 1.512 0.146178
Residual standard error: 22.75 on 20 degrees of freedom
Multiple R-Squared: 0.6141, Adjusted R-squared: 0.5369
F-statistic: 7.957 on 4 and 20 DF, p-value: 0.000523
The book mentions that the model obtained from the original analysis of the data
contained variables weight, bmp, and fev1.
Analyze this model:
## Load the data.
library( package = ISwR )
data( cystfibr )
## Create and analyze the model.
cystfibr.lm <- lm( pemax ~ weight + bmp + fev1, data = cystfibr )
## Create the model with the lowest AIC.
cystfibr.lm.step <-
step(
object = cystfibr.lm,
scope = pemax ~ age + sex + height + weight + bmp + fev1 +
rv + frc + tlc,
direction = "both" )
## Compare the two models.
anova( cystfibr.lm, cystfibr.lm.step )
## Obtain a summary of the model.
summary( cystfibr.lm )
The step() function adds the rv term to the model. But as shown
in the output below, the partial F test suggests that the rv
term of the model is not significant:
> cystfibr.lm.step <-
+ step(
+ object = cystfibr.lm,
+ scope = pemax ~ age + sex + height + weight + bmp + fev1 +
+ rv + frc + tlc,
+ direction = "both" )
Start: AIC= 161.36
pemax ~ weight + bmp + fev1
Df Sum of Sq RSS AIC
+ rv 1 1183.6 10354.6 160.7
<none> 11538.2 161.4
+ frc 1 775.8 10762.4 161.6
+ tlc 1 658.1 10880.0 161.9
+ age 1 584.5 10953.7 162.1
+ height 1 545.6 10992.6 162.2
+ sex 1 2.6 11535.6 163.4
- fev1 1 2552.4 14090.5 164.4
- bmp 1 3515.9 15054.1 166.0
- weight 1 9766.9 21305.1 174.7
Step: AIC= 160.66
pemax ~ weight + bmp + fev1 + rv
Df Sum of Sq RSS AIC
<none> 10354.6 160.7
- rv 1 1183.6 11538.2 161.4
+ tlc 1 197.1 10157.5 162.2
+ height 1 191.0 10163.6 162.2
+ age 1 178.1 10176.5 162.2
+ frc 1 3.4 10351.2 162.6
+ sex 1 2.4 10352.2 162.7
- bmp 1 3072.6 13427.2 165.2
- fev1 1 3717.1 14071.7 166.3
- weight 1 10930.2 21284.8 176.7
> anova( cystfibr.lm, cystfibr.lm.step )
Analysis of Variance Table
Model 1: pemax ~ weight + bmp + fev1
Model 2: pemax ~ weight + bmp + fev1 + rv
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21 11538.2
2 20 10354.6 1 1183.6 2.2861 0.1462
> summary( cystfibr.lm )
Call:
lm(formula = pemax ~ weight + bmp + fev1, data = cystfibr)
Residuals:
Min 1Q Median 3Q Max
-42.388 -13.496 3.991 14.856 40.373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.3336 34.7199 3.639 0.001536
weight 1.5365 0.3644 4.216 0.000387
bmp -1.4654 0.5793 -2.530 0.019486
fev1 1.1086 0.5144 2.155 0.042893
Residual standard error: 23.44 on 21 degrees of freedom
Multiple R-Squared: 0.57, Adjusted R-squared: 0.5086
F-statistic: 9.279 on 3 and 21 DF, p-value: 0.0004180
§ 9.4
Problem 1
Both variables are highly significant in the model.
## Load the data. library( package = ISwR ) data( secher ) ## Take a quick look at the data. ## Note that column 4 of the data.frame, no, is the observation number, ## which is not used in any modeling. ?secher summary( secher[ , 1:3 ] ) get( getOption( "device" ) )() pairs( secher[ , 1:3 ], panel = panel.smooth, main = "secher" ) ## Use a log transformation of the variables, since these are size variables. ## The plots don't show that much difference after the transformation. get( getOption( "device" ) )() pairs( log( secher[ , 1:3 ] ), panel = panel.smooth, main = "log(secher)" ) ## Create and analyze a linear model based on log-transformed data. secher.lm <- lm( log( bwt ) ~ log( bpd ) + log( ad ), data = secher ) summary( secher.lm )
The output is:
pairs plot of secher data
pairs plot of log-transformed secher data
Call:
lm(formula = log(bwt) ~ log(bpd) + log(ad), data = secher)
Residuals:
Min 1Q Median 3Q Max
-0.350742 -0.067409 -0.007916 0.057502 0.363595
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.8615 0.6617 -8.859 2.36e-14
log(bpd) 1.5519 0.2294 6.764 8.09e-10
log(ad) 1.4667 0.1467 9.998 < 2e-16
Residual standard error: 0.1068 on 104 degrees of freedom
Multiple R-Squared: 0.8583, Adjusted R-squared: 0.8556
F-statistic: 314.9 on 2 and 104 DF, p-value: < 2.2e-16
Problem 4
Many data are missing for height and weight.
A subset must be created that can be used for both linear models.
## Load the data.
library( package = ISwR )
data( juul2 )
## Create a subset of the data.
juul2.25 <-
subset(
x = juul2,
subset = !is.na( height ) & !is.na( weight ) & age > 25 )
## Create the two linear models and perform the partial F test.
juul2.25.lm1 <-
lm(
formula = sqrt( igf1 ) ~ age,
data = juul2.25 )
juul2.25.lm2 <-
lm(
formula = sqrt( igf1 ) ~ age + height + weight,
data = juul2.25 )
anova( juul2.25.lm1, juul2.25.lm2 )
## View the results for the best model.
summary( juul2.25.lm1 )
The output is:
> anova( juul2.25.lm1, juul2.25.lm2 )
Analysis of Variance Table
Model 1: sqrt(igf1) ~ age
Model 2: sqrt(igf1) ~ age + height + weight
Res.Df RSS Df Sum of Sq F Pr(>F)
1 34 141.876
2 32 139.419 2 2.457 0.282 0.7561
> summary( juul2.25.lm1 )
Call:
lm(formula = sqrt(igf1) ~ age, data = juul2.25)
Residuals:
Min 1Q Median 3Q Max
-4.57566 -1.59678 -0.01542 1.64865 3.87918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.06696 1.45386 11.739 1.65e-13
age -0.06568 0.03315 -1.982 0.0557
Residual standard error: 2.043 on 34 degrees of freedom
Multiple R-Squared: 0.1035, Adjusted R-squared: 0.07717
F-statistic: 3.927 on 1 and 34 DF, p-value: 0.05565
An equivalent approach is to use the subset argument of lm().
## Load the data.
library( package = ISwR )
data( juul2 )
## Create the two linear models and perform the partial F test.
juul2.lm1 <-
lm(
formula = sqrt( igf1 ) ~ age,
subset = !is.na( height ) & !is.na( weight ) & age > 25,
data = juul2 )
juul2.lm2 <-
lm(
formula = sqrt( igf1 ) ~ age + height + weight,
subset = !is.na( height ) & !is.na( weight ) & age > 25,
data = juul2 )
anova( juul2.lm1, juul2.lm2 )
## View the results for the best model.
summary( juul2.lm1 )
We can simplify this code by using the update() function. The advantage
of this is that it clarifies the differences between the two models.
## Load the data.
library( package = ISwR )
data( juul2 )
## Create the two linear models and perform the partial F test.
juul2.lm1 <-
lm(
formula = sqrt( igf1 ) ~ age,
subset = !is.na( height ) & !is.na( weight ) & age > 25,
data = juul2 )
juul2.lm2 <- update( juul2.lm1, . ~ . + height + weight )
anova( juul2.lm1, juul2.lm2 )
## View the results for the best model.
summary( juul2.lm1 )
The output is:
> anova( juul2.lm1, juul2.lm2 )
Analysis of Variance Table
Model 1: sqrt(igf1) ~ age
Model 2: sqrt(igf1) ~ age + height + weight
Res.Df RSS Df Sum of Sq F Pr(>F)
1 34 141.876
2 32 139.419 2 2.457 0.282 0.7561
> summary( juul2.lm1 )
Call:
lm(formula = sqrt(igf1) ~ age, data = juul2, subset = !is.na(height) &
!is.na(weight) & age > 25)
Residuals:
Min 1Q Median 3Q Max
-4.57566 -1.59678 -0.01542 1.64865 3.87918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.06696 1.45386 11.739 1.65e-13 ***
age -0.06568 0.03315 -1.982 0.0557 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.043 on 34 degrees of freedom
Multiple R-Squared: 0.1035, Adjusted R-squared: 0.07717
F-statistic: 3.927 on 1 and 34 DF, p-value: 0.05565
Chapter 10
§ 10.1
This section returns to the cystfibr data set used in Chapter 9. A plot
of pemax against height suggests that the relationship between
these two variables is not linear. I have added a local regression curve to the plot
using loess().
## Load the data.
library( package = ISwR )
data( cystfibr )
## Plot the data and use loess() to fit a local regression curve.
get( getOption( "device" ) )()
plot( pemax ~ height, data = cystfibr )
cystfibr.loess <- loess( formula = pemax ~ height, data = cystfibr )
## Draw the local regression curve.
x <-
seq(
from = min( cystfibr$height ),
to = max( cystfibr$height ),
by = 2 )
y <-
predict(
object = cystfibr.loess,
newdata = data.frame( height = x ) )
lines( x = x, y = y, col = "red" )
The output is:
Plot of pemax vs. height for cystfibr data
When the results of the new model containing the quadratic term are plotted, the
plotting function matlines() is used. This is much more convenient than
calling lines() several times in succession.
My variation of the code is:
## Load the data.
library( package = ISwR )
data( cystfibr )
## Create the model.
pemax.lm <- lm( formula = pemax ~ height + I( height ^ 2 ), data = cystfibr )
## Create the predicted fitted values, confidence intervals of the mean, and
## prediction intervals.
pred.frame <-
data.frame(
height = seq(
from = min( cystfibr$height ),
to = max( cystfibr$height ),
by = 1 ) )
pc <- predict( object = pemax.lm, newdata = pred.frame, interval = "confidence" )
pp <- predict( object = pemax.lm, newdata = pred.frame, interval = "prediction" )
## Determine the limits for plotting on the y axis.
y.max = max( max( pc ), max( pp ), max( cystfibr$pemax ) )
y.min = min( min( pc ), min( pp ), min( cystfibr$pemax ) )
## Plot the data points.
get( getOption( "device" ) )()
plot( pemax ~ height, data = cystfibr, ylim = c( y.min, y.max ) )
## Add the predicted means, predicted confidence intervals of the mean, and
## prediction intervals. Since columns 1 of pp and pc are identical,
## containing the fitted values, omit plotting column 1 of pp.
matlines(
x = pred.frame$height,
y = pc,
lty = c( 1, 3, 3 ),
col = c( "black", "green4", "green4" ) )
matlines(
x = pred.frame$height,
y = pp[ , c( 2, 3 ) ], ## omit column 1
lty = c( 2, 2 ),
col = c( "blue", "blue" ) )
The output is:
Plot of pexmax ~ height + I( height ^ 2 ) for cystfibr data
§ 10.2
I have no additional notes for this section.
Chapter 11
There is a typographical error on p. 191; the logit function is:
logit p = log[ p/(1 – p)]
§ 11.1
For more information on glm() and on the family argument,
see:
?glm ?family
§ 11.2
## Enter the data:
no.yes <- c( "No", "Yes" )
smoking <- gl( n = 2, k = 1, length = 8, labels = no.yes )
obesity <- gl( n = 2, k = 2, length = 8, labels = no.yes )
snoring <- gl( n = 2, k = 4, length = 8, labels = no.yes )
n.tot <- c( 60, 17, 8, 2, 187, 85, 51, 23 )
n.hyp <- c( 5, 2, 1, 0, 35, 13, 15, 8 )
## General linear model, where the response is a matrix with one column
## is the number of "diseased" and the other column is the number of
## "healthy":
hyp.tbl <- cbind( n.hyp, n.tot - n.hyp )
hyp.tbl.glm <-
glm(
formula = hyp.tbl ~ smoking + obesity + snoring,
family = binomial( "logit" ) )
## General linear model where the response is a vector of the proportion of
## diseased in each cell:
hyp.prop <- n.hyp / n.tot
hyp.prop.glm <-
glm(
formula = hyp.prop ~ smoking + obesity + snoring,
family = binomial( "logit" ),
weights = n.tot )