R: Notes on Statistical Models in S

Table of contents

Introduction

Statistical Models in S was edited by John M. Chambers and Trevor J. Hastie.

The book’s chapters are:

  1. An appetizer
  2. Statistical models
  3. Data for models
  4. Linear models
  5. Analysis of variance; designed experiments
  6. Generalized linear models
  7. Generalized additive models
  8. Local regression models
  9. Tree-based models
  10. Nonlinear models

This page stores my notes as I work through the chapters of this book. I am using R 2.2.1 and R GUI 1.14 for Mac OS X.

Data sets

It is my understanding that the data sets used in this book are included with S-PLUS. I can't verify this since I don't have a copy of S-PLUS. Many of the data sets are available in various R packages.

car.test.frame

library( package = rpart )
?car.test.frame
data( car.test.frame )

cu.summary

library( package = rpart )
?cu.summary
data( cu.summary );

galaxy

library( package = ElemStatLearn )
?galaxy
data( galaxy );

market.survey

This data set is probably available from S-PLUS, but I haven't checked yet.

solder

The solder data set from package faraway contains 900 rows. The solder data set from package rpart contains 720 rows and is identical to the solder.balance data set from S-PLUS, except that some of the values of the row.names attribute are different.

Since the solder.balance data set is a subset of the original solder data set (p. 49), it is likely that the solder data set from package faraway is the original data set.

library( package = faraway )
?solder
data( solder )

solder.balance

One source for this data set is S-PLUS. This data set can be exported from S-PLUS with the following commands:

library( data )
write.table( solder.balance, file = "solder.balance.txt", sep = "\t" );

The data can be imported into R with the following commands:

solder.balance <-
    read.table(
        file      = "solder.balance.txt",
        sep       = "\t",
        header    = TRUE,
        row.names = 1 )

The data are identical to the solder data set from package rpart, except that some of the values of the row.names attribute are different.

Chapter 1 An Appetizer

§ 1.1

Some of the commands in R 2.2.1 work differently than the equivalent commands in S in 1991. The default behavior for plot() on a data.frame of this type is to call pairs().

##  Load the data. Define the column classes because the Panel column contains
##  numeric values that are actually factors.

solder.balance <-
    read.table(
        file       = "solder.balance.txt",
        sep        = "\t",
        header     = TRUE,
        row.names  = 1,
        colClasses = c( NA,       "factor", "factor", "factor",
                        "factor", "factor", "numeric" ) )

##  View a graphical summary of the relationship between the response variable
##  and the factor variables.

get( getOption( "device" ) )()
plot.design( solder.balance )

##  View box plots.

get( getOption( "device" ) )()
par( mfrow = c( 1, 2 ) )
plot( skips ~ Opening + Mask, data = solder.balance )

Chapter 2

§ 2.2.3

At first, I thought that the function model.matrix() extracts the design matrix from an object such as an lm object, but now I understand that model.matrix() creates the design matrix. The function requires a formula and data.

For example, using the trees data set in package datasets and following the example for model.matrix():

require( package = datasets )
ff <- log( Volume ) ~ log( Height ) + log( Girth )
mm <- model.matrix( object = ff, data = trees )
mm

Chapter 3

§ 3.1.1

##  Load the cu.summary data into memory.

library( package = rpart )
data( cu.summary )

§ 3.1.2

##  Load the solder data into memory.

library( package = faraway )
?solder
data( solder )

§ 3.1.3

I can’t find a source of the market.survey and market.frame data.

Chapter 8

§ 8.2.4

##  Load the galaxy data into memory.

library( package = ElemStatLearn )
?galaxy
data( galaxy )

##  Add random noise to the data before plotting it.
##  Note that the command for R 2.2.1 requires specification of hte
##  amount argument in order to get results similar to what is shown
##  in the book.

ew.jittered <- jitter( galaxy$east.west,   factor = 1/2, amount = 0 )
ns.jittered <- jitter( galaxy$north.south, factor = 1/2, amount = 0 )
lim <- range( ew.jittered, ns.jittered )
plot( ew.jittered, ns.jittered, xlim = lim, ylim = lim )