Notes on Introductory Statistics with R

Table of contents

Introduction

Introductory Statistics with R (ISwR) was written by Peter Dalgaard. This is a fine book that is suitable for use in a course in applied multivariate statistics, such as Boston University’s CAS MA 684.

The book’s chapters are:

  1. Basics
  2. Probability and Distributions
  3. Descriptive statistics and graphics
  4. One- and two-sample tests
  5. Regression and correlation
  6. ANOVA and Kruskal-Wallis
  7. Tabular data
  8. Power and the computation of sample size
  9. Multiple regression
  10. Linear models
  11. Logistic regression
  12. Survival analysis

This page stores my notes as I work through the chapters of this book. I am using R 2.2.1 and R GUI 1.14 for Mac OS X.

The ISwR package

The ISwR package contains the data used in the examples and exercises of the book.

Installing ISwR

Here’s how to install the ISwR package using R for Mac OS X:

  1. In the R GUI, click on the Packages & Data menu and select Package Installer.
  2. In the R Package Installer window, click on the popup menu at the upper left of the window and select CRAN (binaries), then click on the Get List button.
  3. R may prompt you at this point to choose a CRAN mirror site.
  4. In the package list, select ISwR, then click on the Install Selected button at the lower right. R will download and install the ISwR package on your system. When the download is complete, the value in the Installed Version column is updated.
  5. The ISwR package requires the survival package. If the survival package is not already installed, install it the same way you installed ISwR.
  6. Close the R Package Installer window.

Loading ISwR

Here is the command for loading the ISwR package in an R session so you can use it:

library( package = ISwR )

For information about the contents of the ISwR package, enter the following command:

help( package = ISwR )

Chapter 1

§ 1.1

You can copy the code below and paste it directly into R to see the results. But you’ll get a better feel for what’s going on, and you’ll begin to internalize the R commands, by typing in the code yourself.

##  Create a vector containing weights.

weight <- c( 60, 72, 57, 90, 95, 72 )

##  Create a vector containing heights.

height <- c( 1.75, 1.80, 1.65, 1.90, 1.74, 1.91 )

##  Create a derived value, bmi, using vectorized math.

bmi <- weight / ( height ^ 2 )
print( bmi )

##  Calculate the mean of the weights and view it.

xbar <- sum( weight ) / length( weight )
print( xbar )

##  Calculate the standard deviation of the weights, using the traditional
##  formula, and view it.

stddev <- sqrt( sum( ( weight - xbar ) ^ 2 ) / ( length( weight ) - 1 ) )
print( stddev )

##  Calculate the standard deviation of the weights using a formula that
##  doesn't require pre-calculating the mean. (This formula is not presented
##  in ISwR.)

variance <-
    ( sum( weight ^ 2 ) - sum( weight ) ^ 2 / length( weight ) ) /
        ( length( weight ) - 1 )
stddev <- sqrt( variance )
print( stddev )

##  Calculate the mean of the weights using R's built-in function.

xbar <- mean( weight )
print( xbar )

##  Calculate the standard deviation of the weights using R's built-in function.

stddev <- sd( weight )

##  Carry out a t test and view the result.

bmi.t.test <- t.test( x = bmi, mu = 22.5, alternative = "two.sided" )
print( bmi.t.test )

##  Plot the points.

plot( x = height, y = weight, pch = 2 )

##  Calculate the coordinates of the predicted weights from the heights
##  using bmi = 22.5, and plot the curve by drawing lines between the
##  predicted points.

hh <- seq( from = 1.65, to = 1.90, by = 0.05 )  # heights
ww <- 22.5 * hh ^ 2                             # predicted weights
lines( x = hh, y = ww )

§ 1.2

The following objects, mostly functions, are described in this section:

I found indexing difficult to master, and it still confuses me at times. Here are some examples:

##  Create a vector and get the 5th item from it.

x <- seq( from = 3, to = 12, by = 1 )
x[ 5 ]  ## 7

##  Create a matrix and get the item in the 2nd row, 3rd column.

m <- as.matrix( 1:12, nrow = 3, ncol = 4, byrow = TRUE )
m
m[ 2, 3 ]  ## 7

##  Create a list with three components; the third component, z, is shorter.

ll <- list( x = 1:12, y = 4:15, z = c( "A", "B", "C" ) )
ll

##  Return a list containing the 2nd component of list ll.

ll[ 2 ]
data.class( ll[ 2 ] )  ##  "list"

##  Return the 2nd component of list ll. Since the 2nd component of the list ll
##  is a vector, this returns the vector.

ll[[ 2 ]]
data.class( ll[[ 2 ]] )  ##  "numeric"

##  Return a list containing the 2nd and 3rd components of list ll.

ll[ c( 2, 3 ) ]

##  Return component "y" of list ll.

ll$y  ##  Same result as ll[[ 2 ]]

##  Return the 3rd item of the vector that is the 2nd component of list ll.

ll[[ 2 ]][ 3 ]     ##  6
ll$y[ 3 ]          ##  identical result
ll[[ c( 2, 3 ) ]]  ##  identical result

##  A data.frame is a list where all components have equal length. This
##  makes it easy to extract elements at specific row and column positions.
##  Note what happens to component 3 when ll is converted to a data.frame.

d.f <- data.frame( ll )
d.f

##  Extract the element at row 3, column 2.

d.f[ 3, 2 ] ##  6

##  Return a data.frame containing component (column) 3 of d.f.

d.f[ 3 ]
data.class( d.f[ 3 ] )  ##  "data.frame"

##  Return a data.frame containing the elements of row 3 of d.f.

d.f[ 3, ]
data.class( d.f[ 3, ] )  ##  "data.frame"

##  Return a vector containing the elements of column 2 of d.f.

d.f[ , 2 ]
data.class( d.f[ , 2 ] )  ##  "numeric"

Conditional selection from data.frames is easy using the subset() function. A command that doesn't use subset() but returns the same result is also shown.

d.f <- data.frame( x = 1:12, y = 4:15, z = LETTERS[ 1:12 ] )

##  Return a data.frame containing the rows of d.f where the values of column 2
##  are greater than 8. These two commands return identical results.

subset( d.f, y > 8 )
d.f[ d.f$y > 8, ]

##  Return a data.frame where the values of column 2 are greater than 8 and
##  less than or equal to 10. These two commands return identical results.

subset( d.f, y > 8 & y <= 10 )
d.f[ d.f$y > 8 & d.f$y <= 10, ]

Here are some simple examples of using lapply(), sapply(), apply(), and tapply():

d.f <- data.frame( x = rnorm( n = 5 ), y = rpois( n = 5, lambda = 2 ) )

lapply( d.f, mean )  ##  column-wise (variable) means, returned as a list

sapply( d.f, mean )  ##  column-wise (variable) means, returned as a vector

apply( d.f, 1, min ) ##  row-wise min values, returned as a vector

d.f2 <-
    data.frame(
        x = rnorm( 10 ),
        y = rnorm( 10 ),
        z = rep( c( "A", "B" ), length = 10 ) )
d.f2
tapply( d.f2$x, d.f2$z, median )

§ 1.3

I have added the code for Figure 1.5, the layout of a standard plot, to my collection of scripts.

§ 1.4

This section on programming is too advanced for this point in the book, but it contains much useful information.

§ 1.5

I have no notes for this section.

§ 1.6

I usually use Perl to format my data before loading it into R. This means I rarely need to use anything other than read.table().

Chapter 5

§ 5.1

The example is easier to understand if a model object, thuesen.lm, is created and used in the subsequent steps. The model object contains the data and the results encapsulated within it; this is a powerful feature of the S programming language.

##  Load the data.

library( package = ISwR )
data( thuesen )

##  Perform the analysis.

thuesen.lm <-
    lm(
        formula = short.velocity ~ blood.glucose,
        data    = thuesen )

##  Show a summary of the coefficients.

options( show.signif.stars = FALSE )
summary( thuesen.lm )

##  View the analysis of variance table.

anova( thuesen.lm )

##  Plot the data with the regression line.
##  This creates Figure 5.1 of the book.

get( getOption( "device" ) )()
plot( short.velocity ~ blood.glucose, data = thuesen )
abline( thuesen.lm )

##  For didactic purposes, compute the p-value for the t test using the
##  observed t, 2.101, and 21 degrees of freedom.
##    pt( q = 2.101, df = 21 ) calculates P( t <= 2.101, df = 21 ).
##    1 - pt( q = 2.101, df = 21 ) calculates P( t >= 2.101, df = 21 ).
##    2 * ( 1 - pt( q = 2.101, df = 21 ) ) calculates
##      P( |t| >= 2.101, df = 21 ), a two-tail test.

p.value <- 2 * ( 1 - pt( q = 2.101, df = 21 ) )
print( p.value )

##  For didactic purposes, compute the p-value for the F test using the
##  observed F, 4.414, with 1 numerator degree of freedom and 21 denominator
##  degrees of freedom.

p.value <- 1 - pf( q = 4.414, df1 = 1, df2 = 21 )
print( p.value )

##  The F value is computed by the summary() function.
##  This example shows how to extract the F value and the degrees of
##  freedom to compute the p-value.

thuesen.lm.summary <- summary( thuesen.lm )

Prob <-
    pf(
        q   = thuesen.lm.summary$fstatistic[ "value" ],
        df1 = thuesen.lm.summary$fstatistic[ "numdf" ],
        df2 = thuesen.lm.summary$fstatistic[ "dendf" ] )

p.value <- 1 - Prob
print( p.value )

§ 5.2

There are many ways to handle missing data. I am bewildered by all the options. Here I amplify the book’s discussion of this problem.

Example 1 below handles missing data implicitly. The default behavior of lm() is to handle missing data based on the value of getOption("na.action"), the “factory-fresh” value of which is na.omit. If we don’t know that this option has the desired value, we can set it using the options() function before calling lm().

The result of handling missing data using na.omit is that the vector returned by fitted(thuesen.lm1) is one element shorter than the input vectors, thuesen$short.velocity and thuesen$blood.glucose.

Furthermore, the lm object’s copy of the data, here thuesen.lm1$model, is a data.frame with no rows containing NA values. We can avoid problems in unequal vector lengths by using the data stored within the lm object.

##  Example 1: implicit na.omit

##  Load the data.

library( package = ISwR )
data( thuesen )

##  Perform the analysis.

options( na.action = na.omit )
thuesen.lm1 <-
    lm(
        formula = short.velocity ~ blood.glucose,
        data    = thuesen )

##  Show a summary of the coefficients.

options( show.signif.stars = FALSE )
summary( thuesen.lm1 )

##  Returns 23:

length( fitted( thuesen.lm1 ) )

##  Returns 24:

length( thuesen$blood.glucose )

##  Returns FALSE:

length( fitted( thuesen.lm1 ) ) == length( thuesen$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen.lm1 ) ) == length( thuesen.lm1$model$blood.glucose )

Example 2 explicitly omits missing data by explicitly setting the using the na.action argument of the lm() function to na.omit. The vector lengths of the original data.frame are still longer than the vector returned by fitted().

##  Example 2: explicit na.omit

##  Load the data.

library( package = ISwR )
data( thuesen )

##  Perform the analysis. Specify the na.action argument.

thuesen.lm2 <-
    lm(
        formula   = short.velocity ~ blood.glucose,
        data      = thuesen,
        na.action = na.omit )

##  Show a summary of the coefficients.

options( show.signif.stars = FALSE )
summary( thuesen.lm2 )

##  Returns 23:

length( fitted( thuesen.lm2 ) )

##  Returns 24:

length( thuesen$blood.glucose )

##  Returns FALSE:

length( fitted( thuesen.lm2 ) ) == length( thuesen$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen.lm2 ) ) == length( thuesen.lm2$model$blood.glucose )

Example 3 explicitly excludes missing data by setting the na.action argument of the lm() function to na.exclude. Rows containing NA tracked by the lm object, but they are not preserved in the objects model data.frame. The result is that the length of the vector returned by fitted() is now the same length as the vectors from the original data.frame. Note that the value of fitted( thuesen.lm3 )[ 16 ] is NA.

##  Example 3: explicit na.exclude

##  Load the data.

library( package = ISwR )
data( thuesen )

##  Perform the analysis. Specify the na.action argument.

thuesen.lm3 <-
    lm(
        formula   = short.velocity ~ blood.glucose,
        data      = thuesen,
        na.action = na.exclude )

##  Show a summary of the coefficients.

options( show.signif.stars = FALSE )
summary( thuesen.lm3 )

##  Returns 24:

length( fitted( thuesen.lm3 ) )

##  Returns 23:

length( thuesen$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen.lm3 ) ) == length( thuesen$blood.glucose )

##  Returns FALSE:

length( fitted( thuesen.lm3 ) ) == length( thuesen.lm3$model$blood.glucose )

Example 4 handles missing data by creating a data.frame that contains no missing data. We create two copies of the thuesen data set from which we have removed rows containing NA values through the use of complete.cases(). Vector lengths of the input data.frame, here thuesen2, are 23, equal to the length of the vector produced by fitted().

##  Example 4: complete.cases

##  Load the data.

library( package = ISwR )
data( thuesen )

##  Copy the data.frame without the NA values.
##  These three commands give the equivalent result.

thuesen2 <- thuesen[ complete.cases( thuesen ), ]
thuesen3 <- subset( thuesen, complete.cases( thuesen ) )

##  Returns TRUE:

isTRUE( all.equal( thuesen2, thuesen3 ) )

##  Perform the analysis on the data subset, here thuesen2.

thuesen2.lm <-
    lm(
        formula = short.velocity ~ blood.glucose,
        data    = thuesen2 )

##  Show a summary of the coefficients.

options( show.signif.stars = FALSE )
summary( thuesen2.lm )

##  Returns 23:

length( fitted( thuesen2.lm ) )

##  Returns 23:

length( thuesen2$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen2.lm ) ) == length( thuesen2$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen2.lm ) ) == length( thuesen2.lm$model$blood.glucose )

In Example 5, we create another a copy of thuesen from which missing data have been removed using the function na.omit. This function returns a data.frame with an additional attribute, na.action, that lists which rows of the original data.frame were removed.

##  Example 5: na.omit

##  Load the data.

library( package = ISwR )
data( thuesen )

##  Copy the data.frame without the NA values.
##  These three commands give the equivalent result.

thuesen4 <- na.omit( thuesen )

##  Perform the analysis on the data subset.

thuesen4.lm <-
    lm(
        formula = short.velocity ~ blood.glucose,
        data    = thuesen4 )

##  Show a summary of the coefficients.

options( show.signif.stars = FALSE )
summary( thuesen4.lm )

##  Returns 23:

length( fitted( thuesen4.lm ) )

##  Returns 23:

length( thuesen4$blood.glucose )

##  Returns 23:

length( thuesen4.lm$model$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen4.lm ) ) == length( thuesen4$blood.glucose )

##  Returns TRUE:

length( fitted( thuesen4.lm ) ) == length( thuesen4.lm$model$blood.glucose )

I have not decided which option I think is best. However, I am inclined to use the data stored in the model data.frame of the lm object rather than the original data in the data.frame for subsequent analysis.

Chapter 6

§ 6.1

Analyze artificial data for one-way layouts. In the first example, there is no relationship between y and x. In the second example, there is a relationship between y and x.

##  Create artificial data in which there is no relationship between x and y.

n <- 100
x1 <- factor( c( rep( "A", n / 2 ), rep( "B", n / 2 ) ) )
y1 <- rnorm( n = n, mean = 0, sd = 1 )
xy1 <- data.frame( x = x1, y = y1 )

##  Create artificial data in which there *is* a relationship between x and y
##  (the mean is different for each level).

n <- 100
x2 <- factor( c( rep( "A", n / 2 ), rep( "B", n / 2 ) ) )
y2 <-
    c(
        rnorm( n = n / 2, mean = 0, sd = 1 ),
        rnorm( n = n / 2, mean = 2, sd = 1 ) )
xy2 <- data.frame( x = x2, y = y2 )

##  Plot on the default device.
##  Boxplots are used by default.
##  Use the same vertical scale for each plot.

get( getOption( "device" ) )()
par( mfrow = c( 1, 2 ) )

plot(
    y ~ x,
    data = xy1,
    ylim = c( min( xy1$y, xy2$y ), max( xy1$y, xy2$y ) ) )

plot(
    y ~ x,
    data = xy2,
    ylim = c( min( xy1$y, xy2$y ), max( xy1$y, xy2$y ) ) )

##  Create and examine the models.

xy1.lm <- lm( y ~ x, data = xy1 )
summary( xy1.lm )

xy2.lm <- lm( y ~ x, data = xy2 )
summary( xy2.lm )

The output of the calls to summary summary() is:

> summary( xy1.lm )

Call:
lm(formula = y ~ x, data = xy1)

Residuals:
     Min       1Q   Median       3Q      Max
-1.98474 -0.67045  0.06854  0.63067  2.03575

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009498   0.125063   0.076    0.940
xB          0.078829   0.176866   0.446    0.657

Residual standard error: 0.8843 on 98 degrees of freedom
Multiple R-Squared: 0.002023,	Adjusted R-squared: -0.008161
F-statistic: 0.1986 on 1 and 98 DF,  p-value: 0.6568

> summary( xy2.lm )

Call:
lm(formula = y ~ x, data = xy2)

Residuals:
       Min         1Q     Median         3Q        Max
-2.171e+00 -6.399e-01  1.262e-05  5.791e-01  2.859e+00

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06881    0.13667   0.503    0.616
xB           1.89758    0.19329   9.817 3.01e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9664 on 98 degrees of freedom
Multiple R-Squared: 0.4958,	Adjusted R-squared: 0.4907
F-statistic: 96.38 on 1 and 98 DF,  p-value: 3.015e-16

The graphical output is:

Box plots of artifical one-way layouts

Box plots of artificial one-way layouts

Analyze the red.cell.folate data set. I happen to favor box plots, but I follow the book in adding a strip plot to the figure. I jitter the points, although this is not necessary for this data set.

##  Load the data.

library( package = "ISwR" )
data( red.cell.folate )

##  Look at data summaries, including a plot (which is a boxplot by default).
##  Follow the example in the book by adding a stripchart.

?red.cell.folate
summary( red.cell.folate )
get( getOption( "device" ) )()
plot( folate ~ ventilation, data = red.cell.folate )
stripchart(
    x        = red.cell.folate$folate ~ red.cell.folate$ventilation,
    method   = "jitter",
    jitter   = 0.05,
    pch      = 16,
    vertical = TRUE,
    add      = TRUE,
    col      = "blue" )

##  Create and examine the linear model.

rcf.lm <- lm( folate ~ ventilation, data = red.cell.folate )
anova( rcf.lm )
summary( rcf.lm )

The graphical output is:

Combined box and strip plots of red.cell.folate data

Combined box and strip plots of red.cell.folate data

The example with the juul data set is easier if you don't attach the data set first. Then it’s not necessary to detach and attach the data set after converting juul$tanner to a factor object.

§ 6.2

Instead of using attach() to attach the data, an alternative is to use with(). Here is my version of the Kruskal-Wallis example:

##  Load the data.

library( package = "ISwR" )
data( red.cell.folate )

##  Kruskal-Wallis rank sum test:

with(
    data = red.cell.folate,
    expr = kruskal.test( folate ~ ventilation ) )

§ 6.3

My version of the interaction.plot() example:

##  Load the data.

library( package = "ISwR" )
data( heart.rate )

##  Create an interaction plot.

with(
    data = heart.rate,
    expr = interaction.plot(
               x.factor     = time,
               trace.factor = subj,
               response     = hr,
               col          = rainbow( n = 9 ) ) )

The output is:

Interaction plot of heart.rate data

Interaction plot of heart.rate data

Chapter 9

§ 9.1

It is always a good idea to take a quick look at the data. In addition to using the pairs() function to quickly create pairwise scatter plots for all variables, it is also useful to call summary() and cor().

summary(cystfibr) reveals that the range of the age variable is only 7–23 years. cor() reveals that age, height, and weight are highly positively correlated and rv and frc are highly positively correlated. Correlations are easier to view using an appropriate correlation plot.

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  View the documentation for the cystfibr data.

?cystfibr

##  Get quick summaries of the data.

summary( cystfibr )
options( digits = 2 )
cor( x = cystfibr, method = "spearman" )

The output from summary() and cor() is:

> summary( cystfibr )
      age             sex           height          weight
 Min.   : 7.00   Min.   :0.00   Min.   :109.0   Min.   :12.90
 1st Qu.:11.00   1st Qu.:0.00   1st Qu.:139.0   1st Qu.:25.10
 Median :14.00   Median :0.00   Median :156.0   Median :37.20
 Mean   :14.48   Mean   :0.44   Mean   :152.8   Mean   :38.40
 3rd Qu.:17.00   3rd Qu.:1.00   3rd Qu.:174.0   3rd Qu.:51.10
 Max.   :23.00   Max.   :1.00   Max.   :180.0   Max.   :73.80
      bmp             fev1             rv             frc
 Min.   :64.00   Min.   :18.00   Min.   :158.0   Min.   :104.0
 1st Qu.:68.00   1st Qu.:26.00   1st Qu.:188.0   1st Qu.:127.0
 Median :71.00   Median :33.00   Median :225.0   Median :139.0
 Mean   :78.28   Mean   :34.72   Mean   :255.2   Mean   :155.4
 3rd Qu.:90.00   3rd Qu.:44.00   3rd Qu.:305.0   3rd Qu.:183.0
 Max.   :97.00   Max.   :57.00   Max.   :449.0   Max.   :268.0
      tlc          pemax
 Min.   : 81   Min.   : 65.0
 1st Qu.:101   1st Qu.: 85.0
 Median :113   Median : 95.0
 Mean   :114   Mean   :109.1
 3rd Qu.:128   3rd Qu.:130.0
 Max.   :147   Max.   :195.0

> options( digits = 2 )
> cor( x = cystfibr, method = "spearman" )

         age    sex height weight   bmp  fev1    rv   frc    tlc pemax
age     1.00 -0.163   0.93   0.90  0.51  0.30 -0.58 -0.72 -0.493  0.52
sex    -0.16  1.000  -0.23  -0.20 -0.15 -0.54  0.26  0.15  0.056 -0.26
height  0.93 -0.235   1.00   0.96  0.57  0.43 -0.62 -0.66 -0.473  0.59
weight  0.90 -0.201   0.96   1.00  0.73  0.46 -0.70 -0.67 -0.485  0.49
bmp     0.51 -0.146   0.57   0.73  1.00  0.56 -0.69 -0.55 -0.494  0.22
fev1    0.30 -0.542   0.43   0.46  0.56  1.00 -0.68 -0.60 -0.440  0.31
rv     -0.58  0.257  -0.62  -0.70 -0.69 -0.68  1.00  0.85  0.589 -0.31
frc    -0.72  0.151  -0.66  -0.67 -0.55 -0.60  0.85  1.00  0.672 -0.38
tlc    -0.49  0.056  -0.47  -0.48 -0.49 -0.44  0.59  0.67  1.000 -0.15
pemax   0.52 -0.258   0.59   0.49  0.22  0.31 -0.31 -0.38 -0.148  1.00

For the pairwise scatter plots created by the pairs() function of R 2.2.1 for Mac OS X, the default font size for the axes and the labels is a little too small to be readable when using the quartz() device. Readability of very small fonts is improved if text smoothing is allowed for font sizes greater than 4 in the Appearance pane of Mac OS X System Preferences.

The book’s example sets the plot parameter par(mex=0.5), but in R 2.2.1 for Mac OS X, this has no effect on the plot.

This code produces readable results similar to Figure 9.1:

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  Pairwise scatter plots with larger labels:

pairs( cystfibr, gap = 0, cex.labels = 1.1, cex.axis = 1.1 )

##  Equivalent plot:

plot( cystfibr, gap = 0, cex.labels = 1.1, cex.axis = 1.1 )

§ 9.2

Rather than using attach() to attach the data, it is easy to specify the data frame in the call to lm(). It is useful to save the results returned by lm() into a variable, here cystfibr.lm. This example turns off the display of significance stars.

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  Create a linear model and examine the significance of the variables.

cystfibr.lm <-
    lm(
        pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc,
        data = cystfibr )
options( show.signif.stars = FALSE )
summary( cystfibr.lm )
anova( cystfibr.lm )

It is important to note that summary() gives the significance of each variable when it is added last to the model, whereas anova() gives the significance of each variable as it is added in order to the model. That is, the order of the variables in the model is important for the F tests produced by anova(), with the F tests being successive for the addition of each of the variables to the model.

Partial F tests can be carried out using anova(), although the documentation for anova() doesn't explain how.

##  Partial F test for variable tlc:

cystfibr.anova <-
    anova(
        lm( pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc,
            data = cystfibr ),
        lm(
            pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc,
            data = cystfibr ) )

print( cystfibr.anova )

The output is:

Analysis of Variance Table

Model 1: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc
Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc +
    tlc
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     16 9823.7
2     15 9731.2  1      92.4 0.1424 0.7112

Notice that the p-value, 0.7112, is identical to the p-value for the t test for the coefficient of tlc produced by the summary(cystfibr.lm) function. That is because the tests are equivalent.

The book demonstrates a partial F test for all variables after age. The result shows that after age has been included in the model, none of the other variables significantly improves the model.

##  Partial F test for all variables after age has been included in the model.

cystfibr.lm1 <-
    lm(
        pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc,
        data = cystfibr )
cystfibr.lm2 <-
    lm(
        pemax ~ age,
        data = cystfibr )
anova( cystfibr.lm2, cystfibr.lm1 )

The p-values given by the summary() function are equivalent to the p-values generated by a partial F test for each variable when that variable is added last to the model.

The anova() function calculates Type I (sequential) tests. The Anova() function in package car, written by John Fox, computes Type II and Type III tests. See ?Anova for more information. For voluminous discussion of Type III tests, see RSiteSearch("Type III").

§ 9.3

The manual analysis of the significance of the variables in the linear model is instructive because it incorporates prior information.

The book mentions step() but doesn't use it. The step() function returns the linear model that has the smallest AIC. Here we use step() to create the linear model.

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  Calculate the initial linear model by including all variables.

cystfibr.lm <-
    lm(
        formula = pemax ~ age + sex + height + weight + bmp + fev1 +
                  rv + frc + tlc,
        data    = cystfibr )

##  Calculate a revised linear model using step().

cystfibr.lm.step <- step( object = cystfibr.lm, direction = "both" )

##  Perform a partial F test on the revised model and the original model.

anova( cystfibr.lm.step, cystfibr.lm )

##  View the coefficients of the revised model.

summary( cystfibr.lm.step )

The output from the step() function is:

> cystfibr.lm.step <- step( object = cystfibr.lm, direction = "both" )

Start:  AIC= 169.11
 pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc +
    tlc

         Df Sum of Sq     RSS     AIC
- sex     1      37.9  9769.2   167.2
- tlc     1      92.4  9823.7   167.3
- height  1     158.3  9889.6   167.5
- age     1     181.8  9913.1   167.6
- frc     1     254.6  9985.8   167.8
- fev1    1     648.4 10379.7   168.7
- rv      1     653.8 10385.0   168.7
<none>                 9731.2   169.1
- weight  1    1441.2 11172.5   170.6
- bmp     1    1480.1 11211.4   170.6

Step:  AIC= 167.2
 pemax ~ age + height + weight + bmp + fev1 + rv + frc + tlc

         Df Sum of Sq     RSS     AIC
- tlc     1     115.9  9885.1   165.5
- height  1     131.2  9900.4   165.5
- age     1     145.6  9914.7   165.6
- frc     1     221.5  9990.7   165.8
- rv      1     636.2 10405.3   166.8
<none>                 9769.2   167.2
- weight  1    1446.2 11215.4   168.7
- bmp     1    1474.7 11243.9   168.7
+ sex     1      37.9  9731.2   169.1
- fev1    1    1770.4 11539.6   169.4

Step:  AIC= 165.5
 pemax ~ age + height + weight + bmp + fev1 + rv + frc

         Df Sum of Sq     RSS     AIC
- frc     1     133.2 10018.3   163.8
- height  1     215.8 10100.9   164.0
- age     1     252.2 10137.3   164.1
- rv      1     543.5 10428.6   164.8
<none>                 9885.1   165.5
+ tlc     1     115.9  9769.2   167.2
+ sex     1      61.4  9823.7   167.3
- fev1    1    1727.4 11612.5   167.5
- weight  1    2132.5 12017.6   168.4
- bmp     1    2354.3 12239.4   168.8

Step:  AIC= 163.83
 pemax ~ age + height + weight + bmp + fev1 + rv

         Df Sum of Sq     RSS     AIC
- age     1     145.3 10163.6   162.2
- height  1     158.2 10176.5   162.2
- rv      1     568.1 10586.3   163.2
<none>                10018.3   163.8
+ frc     1     133.2  9885.1   165.5
+ tlc     1      27.6  9990.7   165.8
+ sex     1   0.04362 10018.2   165.8
- weight  1    2027.2 12045.5   166.4
- bmp     1    2324.1 12342.3   167.0
- fev1    1    2851.2 12869.5   168.1

Step:  AIC= 162.19
 pemax ~ height + weight + bmp + fev1 + rv

         Df Sum of Sq     RSS     AIC
- height  1     191.0 10354.6   160.7
- rv      1     829.0 10992.6   162.2
<none>                10163.6   162.2
+ age     1     145.3 10018.3   163.8
+ tlc     1     102.3 10061.3   163.9
+ frc     1      26.3 10137.3   164.1
+ sex     1       0.6 10163.0   164.2
- weight  1    2603.5 12767.0   165.9
- bmp     1    2743.5 12907.1   166.2
- fev1    1    3210.9 13374.5   167.1

Step:  AIC= 160.66
 pemax ~ weight + bmp + fev1 + rv

         Df Sum of Sq     RSS     AIC
<none>                10354.6   160.7
- rv      1    1183.6 11538.2   161.4
+ tlc     1     197.1 10157.5   162.2
+ height  1     191.0 10163.6   162.2
+ age     1     178.1 10176.5   162.2
+ frc     1       3.4 10351.2   162.6
+ sex     1       2.4 10352.2   162.7
- bmp     1    3072.6 13427.2   165.2
- fev1    1    3717.1 14071.7   166.3
- weight  1   10930.2 21284.8   176.7

The output from the anova() and summary() functions is:

> anova( cystfibr.lm.step, cystfibr.lm )

Analysis of Variance Table

Model 1: pemax ~ weight + bmp + fev1 + rv
Model 2: pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc +
    tlc
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     20 10354.6
2     15  9731.2  5     623.4 0.1922 0.9609

> summary( cystfibr.lm.step )

Call:
lm(formula = pemax ~ weight + bmp + fev1 + rv, data = cystfibr)

Residuals:
   Min     1Q Median     3Q    Max
-39.77 -11.74   4.33  15.66  35.07

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.94669   53.27673   1.200 0.244057
weight       1.74891    0.38063   4.595 0.000175
bmp         -1.37724    0.56534  -2.436 0.024322
fev1         1.54770    0.57761   2.679 0.014410
rv           0.12572    0.08315   1.512 0.146178

Residual standard error: 22.75 on 20 degrees of freedom
Multiple R-Squared: 0.6141,	Adjusted R-squared: 0.5369
F-statistic: 7.957 on 4 and 20 DF,  p-value: 0.000523

The book mentions that the model obtained from the original analysis of the data contained variables weight, bmp, and fev1. Analyze this model:

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  Create and analyze the model.

cystfibr.lm <- lm( pemax ~ weight + bmp + fev1, data = cystfibr )

##  Create the model with the lowest AIC.

cystfibr.lm.step <-
    step(
        object    = cystfibr.lm,
        scope     = pemax ~ age + sex + height + weight + bmp + fev1 +
                    rv + frc + tlc,
        direction = "both" )

##  Compare the two models.

anova( cystfibr.lm, cystfibr.lm.step )

##  Obtain a summary of the model.

summary( cystfibr.lm )

The step() function adds the rv term to the model. But as shown in the output below, the partial F test suggests that the rv term of the model is not significant:

> cystfibr.lm.step <-
+     step(
+         object    = cystfibr.lm,
+         scope     = pemax ~ age + sex + height + weight + bmp + fev1 +
+                     rv + frc + tlc,
+         direction = "both" )

Start:  AIC= 161.36
 pemax ~ weight + bmp + fev1

         Df Sum of Sq     RSS     AIC
+ rv      1    1183.6 10354.6   160.7
<none>                11538.2   161.4
+ frc     1     775.8 10762.4   161.6
+ tlc     1     658.1 10880.0   161.9
+ age     1     584.5 10953.7   162.1
+ height  1     545.6 10992.6   162.2
+ sex     1       2.6 11535.6   163.4
- fev1    1    2552.4 14090.5   164.4
- bmp     1    3515.9 15054.1   166.0
- weight  1    9766.9 21305.1   174.7

Step:  AIC= 160.66
 pemax ~ weight + bmp + fev1 + rv

         Df Sum of Sq     RSS     AIC
<none>                10354.6   160.7
- rv      1    1183.6 11538.2   161.4
+ tlc     1     197.1 10157.5   162.2
+ height  1     191.0 10163.6   162.2
+ age     1     178.1 10176.5   162.2
+ frc     1       3.4 10351.2   162.6
+ sex     1       2.4 10352.2   162.7
- bmp     1    3072.6 13427.2   165.2
- fev1    1    3717.1 14071.7   166.3
- weight  1   10930.2 21284.8   176.7

> anova( cystfibr.lm, cystfibr.lm.step )

Analysis of Variance Table

Model 1: pemax ~ weight + bmp + fev1
Model 2: pemax ~ weight + bmp + fev1 + rv
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     21 11538.2
2     20 10354.6  1    1183.6 2.2861 0.1462

> summary( cystfibr.lm )

Call:
lm(formula = pemax ~ weight + bmp + fev1, data = cystfibr)

Residuals:
    Min      1Q  Median      3Q     Max
-42.388 -13.496   3.991  14.856  40.373

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.3336    34.7199   3.639 0.001536
weight        1.5365     0.3644   4.216 0.000387
bmp          -1.4654     0.5793  -2.530 0.019486
fev1          1.1086     0.5144   2.155 0.042893

Residual standard error: 23.44 on 21 degrees of freedom
Multiple R-Squared:  0.57,	Adjusted R-squared: 0.5086
F-statistic: 9.279 on 3 and 21 DF,  p-value: 0.0004180

§ 9.4

Problem 1

Both variables are highly significant in the model.

##  Load the data.

library( package = ISwR )
data( secher )

##  Take a quick look at the data.
##  Note that column 4 of the data.frame, no, is the observation number,
##  which is not used in any modeling.

?secher
summary( secher[ , 1:3 ] )
get( getOption( "device" ) )()
pairs( secher[ , 1:3 ], panel = panel.smooth, main = "secher" )

##  Use a log transformation of the variables, since these are size variables.
##  The plots don't show that much difference after the transformation.

get( getOption( "device" ) )()
pairs( log( secher[ , 1:3 ] ), panel = panel.smooth, main = "log(secher)" )

##  Create and analyze a linear model based on log-transformed data.

secher.lm <- lm( log( bwt ) ~ log( bpd ) + log( ad ), data = secher )
summary( secher.lm )

The output is:

pairs plot of secher data

pairs plot of secher data

pairs plot of log-transformed secher data

pairs plot of log-transformed secher data

Call:
lm(formula = log(bwt) ~ log(bpd) + log(ad), data = secher)

Residuals:
      Min        1Q    Median        3Q       Max
-0.350742 -0.067409 -0.007916  0.057502  0.363595

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -5.8615     0.6617  -8.859 2.36e-14
log(bpd)      1.5519     0.2294   6.764 8.09e-10
log(ad)       1.4667     0.1467   9.998  < 2e-16

Residual standard error: 0.1068 on 104 degrees of freedom
Multiple R-Squared: 0.8583,	Adjusted R-squared: 0.8556
F-statistic: 314.9 on 2 and 104 DF,  p-value: < 2.2e-16

Problem 4

Many data are missing for height and weight. A subset must be created that can be used for both linear models.

##  Load the data.

library( package = ISwR )
data( juul2 )

##  Create a subset of the data.

juul2.25 <-
    subset(
        x      = juul2,
        subset = !is.na( height ) & !is.na( weight ) & age > 25 )

##  Create the two linear models and perform the partial F test.

juul2.25.lm1 <-
    lm(
        formula = sqrt( igf1 ) ~ age,
        data    = juul2.25 )

juul2.25.lm2 <-
    lm(
        formula = sqrt( igf1 ) ~ age + height + weight,
        data    = juul2.25 )

anova( juul2.25.lm1, juul2.25.lm2 )

##  View the results for the best model.

summary( juul2.25.lm1 )

The output is:

> anova( juul2.25.lm1, juul2.25.lm2 )
Analysis of Variance Table

Model 1: sqrt(igf1) ~ age
Model 2: sqrt(igf1) ~ age + height + weight
  Res.Df     RSS Df Sum of Sq     F Pr(>F)
1     34 141.876
2     32 139.419  2     2.457 0.282 0.7561

> summary( juul2.25.lm1 )

Call:
lm(formula = sqrt(igf1) ~ age, data = juul2.25)

Residuals:
     Min       1Q   Median       3Q      Max
-4.57566 -1.59678 -0.01542  1.64865  3.87918

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.06696    1.45386  11.739 1.65e-13
age         -0.06568    0.03315  -1.982   0.0557

Residual standard error: 2.043 on 34 degrees of freedom
Multiple R-Squared: 0.1035,	Adjusted R-squared: 0.07717
F-statistic: 3.927 on 1 and 34 DF,  p-value: 0.05565

An equivalent approach is to use the subset argument of lm().

##  Load the data.

library( package = ISwR )
data( juul2 )

##  Create the two linear models and perform the partial F test.

juul2.lm1 <-
    lm(
        formula = sqrt( igf1 ) ~ age,
        subset  = !is.na( height ) & !is.na( weight ) & age > 25,
        data    = juul2 )

juul2.lm2 <-
    lm(
        formula = sqrt( igf1 ) ~ age + height + weight,
        subset  = !is.na( height ) & !is.na( weight ) & age > 25,
        data    = juul2 )

anova( juul2.lm1, juul2.lm2 )

##  View the results for the best model.

summary( juul2.lm1 )

We can simplify this code by using the update() function. The advantage of this is that it clarifies the differences between the two models.

##  Load the data.

library( package = ISwR )
data( juul2 )

##  Create the two linear models and perform the partial F test.

juul2.lm1 <-
    lm(
        formula = sqrt( igf1 ) ~ age,
        subset  = !is.na( height ) & !is.na( weight ) & age > 25,
        data    = juul2 )

juul2.lm2 <- update( juul2.lm1, . ~ . + height + weight )

anova( juul2.lm1, juul2.lm2 )

##  View the results for the best model.

summary( juul2.lm1 )

The output is:

> anova( juul2.lm1, juul2.lm2 )
Analysis of Variance Table

Model 1: sqrt(igf1) ~ age
Model 2: sqrt(igf1) ~ age + height + weight
  Res.Df     RSS Df Sum of Sq     F Pr(>F)
1     34 141.876
2     32 139.419  2     2.457 0.282 0.7561

> summary( juul2.lm1 )

Call:
lm(formula = sqrt(igf1) ~ age, data = juul2, subset = !is.na(height) &
    !is.na(weight) & age > 25)

Residuals:
     Min       1Q   Median       3Q      Max
-4.57566 -1.59678 -0.01542  1.64865  3.87918

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.06696    1.45386  11.739 1.65e-13 ***
age         -0.06568    0.03315  -1.982   0.0557 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.043 on 34 degrees of freedom
Multiple R-Squared: 0.1035,	Adjusted R-squared: 0.07717
F-statistic: 3.927 on 1 and 34 DF,  p-value: 0.05565

Chapter 10

§ 10.1

This section returns to the cystfibr data set used in Chapter 9. A plot of pemax against height suggests that the relationship between these two variables is not linear. I have added a local regression curve to the plot using loess().

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  Plot the data and use loess() to fit a local regression curve.

get( getOption( "device" ) )()
plot( pemax ~ height, data = cystfibr )
cystfibr.loess <- loess( formula = pemax ~ height, data = cystfibr )

##  Draw the local regression curve.

x <-
    seq(
        from = min( cystfibr$height ),
        to   = max( cystfibr$height ),
        by   = 2 )

y <-
    predict(
        object   = cystfibr.loess,
        newdata  = data.frame( height = x ) )

lines( x = x, y = y, col = "red" )

The output is:

Plot of pemax vs. height for cystfibr data

Plot of pemax vs. height for cystfibr data

When the results of the new model containing the quadratic term are plotted, the plotting function matlines() is used. This is much more convenient than calling lines() several times in succession.

My variation of the code is:

##  Load the data.

library( package = ISwR )
data( cystfibr )

##  Create the model.

pemax.lm <- lm( formula = pemax ~ height + I( height ^ 2 ), data = cystfibr )

##  Create the predicted fitted values, confidence intervals of the mean, and
##  prediction intervals.

pred.frame <-
    data.frame(
        height = seq(
                     from = min( cystfibr$height ),
                     to   = max( cystfibr$height ),
                     by   = 1 ) )

pc <- predict( object = pemax.lm, newdata = pred.frame, interval = "confidence" )

pp <- predict( object = pemax.lm, newdata = pred.frame, interval = "prediction" )

##  Determine the limits for plotting on the y axis.

y.max = max( max( pc ), max( pp ), max( cystfibr$pemax ) )
y.min = min( min( pc ), min( pp ), min( cystfibr$pemax ) )

##  Plot the data points.

get( getOption( "device" ) )()
plot( pemax ~ height, data = cystfibr, ylim = c( y.min, y.max ) )

##  Add the predicted means, predicted confidence intervals of the mean, and
##  prediction intervals. Since columns 1 of pp and pc are identical,
##  containing the fitted values, omit plotting column 1 of pp.

matlines(
    x   = pred.frame$height,
    y   = pc,
    lty = c( 1, 3, 3 ),
    col = c( "black", "green4", "green4" ) )

matlines(
    x   = pred.frame$height,
    y   = pp[ , c( 2, 3 ) ],  ##  omit column 1
    lty = c( 2, 2 ),
    col = c( "blue", "blue" ) )

The output is:

Plot of pemax ~ height + I( height ^ 2 ) for cystfibr data

Plot of pexmax ~ height + I( height ^ 2 ) for cystfibr data

§ 10.2

I have no additional notes for this section.

Chapter 11

There is a typographical error on p. 191; the logit function is:

logit p = log[ p/(1 – p)]

§ 11.1

For more information on glm() and on the family argument, see:

?glm
?family

§ 11.2

##  Enter the data:

no.yes <- c( "No", "Yes" )
smoking <- gl( n = 2, k = 1, length = 8, labels = no.yes )
obesity <- gl( n = 2, k = 2, length = 8, labels = no.yes )
snoring <- gl( n = 2, k = 4, length = 8, labels = no.yes )
n.tot <- c(  60,  17,   8,   2, 187,  85,  51,  23 )
n.hyp <- c(   5,   2,   1,   0,  35,  13,  15,   8 )

##  General linear model, where the response is a matrix with one column
##  is the number of "diseased" and the other column is the number of
##  "healthy":

hyp.tbl <- cbind( n.hyp, n.tot - n.hyp )
hyp.tbl.glm <-
    glm(
        formula = hyp.tbl ~ smoking + obesity + snoring,
        family  = binomial( "logit" ) )

##  General linear model where the response is a vector of the proportion of
##  diseased in each cell:

hyp.prop <- n.hyp / n.tot
hyp.prop.glm <-
    glm(
        formula = hyp.prop ~ smoking + obesity + snoring,
        family  = binomial( "logit" ),
        weights = n.tot )