Lab1: Linear Models and ANOVA in R


Getting Started with R

Click on the R icon

You are now ready to continue. Let the games begin.


What is a data frame?

Data frames are a class of objects to represent data which are usually used in fitting models.
They share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.

R has a library of datasets available with which to experiment. One of these, Puromycin we will use in a later lab.

Data frames can be created by using the read.table(), data.frame(), or the as.data.frame() functions.

Suppose a table is recorded in a text file.
The read.table command creates a data frame from a file.

Here is the file puromycin.dat

> temp <- read.table("http://www.rohan.sdsu.edu/~babailey/stat700/puromycin.dat", header = F, col.names = c("conc", "vel", "state"))

The data frames temp and Puromycin should be identical (we'll see this in a future lab).

Note: if the file had the column headings, conc, vel, state, then you can use the header=T option.
The option skip=0 is the number of lines in the file to skip before reading data.


Fiting Linear Regression Models

The function lm() returns an R object called a fitted linear least-squares model object, or lm object for short:

lm(formula, data, ...)

For more information see the help function:

> help(lm)

We will use the R dataset LifeCycleSavings

> help(LifeCycleSavings)

We can plot the data:

> plot(LifeCycleSavings$pop15, LifeCycleSavings$sr)
> title("First LCSavings Plot")

Let's fit a simple linear regression model:

> lmfit <- lm(sr ~ pop15, data=LifeCycleSavings) 

What is the class of the object lmfit?

What is in the object lmfit? (the names function is useful)

You can get the summary and diagnostic plots by using the functions:

> summary(lmfit)
> par(mfrow=c(2,2)) #2x2 plotting region (there is also a layout() function)
> plot(lmfit)
Note: that if you do not want an intercept, then formula = sr ~ pop15 - 1

How did R calculate the standard errors for the estimates? There is a vcov.lm function

> vcov(lmfit)
> sqrt(diag(vcov(lmfit)))


Object Oriented Programming

From the help file of lm, see under See Also

What are the generic functions for lm and what do they do? To get an ANOVA table:

 
> anova(lmfit) 


Multiple Linear Regression Models

Let's fit the full linear model for our dataset.

> pairs(LifeCycleSavings, panel = panel.smooth,
           main = "LifeCycleSavings data")
> fm1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
> summary(fm1)

What about diagnostic plots? How well does the model fit the data?

To test whether any of the predictors have significance in the model, i.e., whether
b_1 = b_2 = b_3 = b_4 = 0. We can look at the p-value 0.00079, is so small the null hypothesis is rejected.


What about model selection?

> help(step)
> step(fm1)
What is AIC?

> help(AIC)
> help(extractAIC)


What about prediction?

There is a predict.lm function.


ANOVA

Recall, that above

To test whether any of the predictors have significance in the model, i.e., whether
b_1 = b_2 = b_3 = b_4 = 0. We can look at the p-value 0.00079, is so small the null hypothesis is rejected.

or directly,

> tss <- sum((LifeCycleSavings$sr - mean(LifeCycleSavings$sr))^2)
> rss <- deviance(fm1)
> df.residual(fm1)
> fstat <- ((tss-rss)/4)/(rss/df.residual(fm1))
> 1-pf(fstat,4,df.residual(fm1))

Test the null hypothesis that b_1 = 0, that pop15 is not significant and can be droped from the full model. (The test is given that all the other predictors are in the model.)

You could observe the t-value from the full model (p-value is 0.0026) and reject the null hypothesis. Or

A convenient way to compare two nested models is:

> m2 <- lm(sr ~ pop75 + dpi + ddpi, LifeCycleSavings)
> anova(m2, fm1)

See if you can compute the p-value directly and the t-based test and p-value.


Reference: Linear Models with R, by J.J. Faraway