To follow along, use the lab10 code.

Now load the packages

library("tidyverse")
library("Sleuth3")

ANOVA Table

To run an ANOVA in R, use the anova() function on an lm object.

m <- lm(Lifetime ~ Diet, data = case0501)
anova(m)
## Analysis of Variance Table
## 
## Response: Lifetime
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Diet        5  12734  2546.8  57.104 < 2.2e-16 ***
## Residuals 343  15297    44.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note a few things about this table:

This F-test also shows up in the R regression output:

summary(m)
## 
## Call:
## lm(formula = Lifetime ~ Diet, data = case0501)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.6912     0.8846  36.958  < 2e-16 ***
## DietN/R40    12.4254     1.2352  10.059  < 2e-16 ***
## DietN/R50     9.6060     1.1877   8.088 1.06e-14 ***
## DietNP       -5.2892     1.3010  -4.065 5.95e-05 ***
## DietR/R50    10.1945     1.2565   8.113 8.88e-15 ***
## Dietlopro     6.9945     1.2565   5.567 5.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

where the last line provides the F-statistic, the numerator and denominator degrees of freedom, and the associated p-value. The two models being compared here are the intercept-only model and the model with all explanatory variables (in this case the indicator functions for all but one Diet).

This F-test can be constructed manually by fitting these two models and using the anova() function to compare the models.

m0 <- lm(Lifetime ~ 1, data = case0501) # 1 indicates intercept 
anova(m0, m)
## Analysis of Variance Table
## 
## Model 1: Lifetime ~ 1
## Model 2: Lifetime ~ Diet
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    348 28031                                  
## 2    343 15297  5     12734 57.104 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This table contains the information from the intercept-only model in row 1 and the information from the model with Diet, as a categorical variable, in row 2.

Activity

The ex0920 data set contains information on the Kentucky Derby winners from 1896-2011.

  1. Fit a multiple regression model of Speed on Starters and the 7 level version of Track conditions.
  2. Verify that the F-statistic in the summary() output of this multiple regression model compares the intercept-only model with the model with all of the explanatory variables.

Bonus

  1. After completing this, take a look at the ANOVA table on the regression model and see if you can figure out what the different lines represent.
Click for solution
m0 <- lm(Speed ~ 1,                data = ex0920)
m  <- lm(Speed ~ Starters + Track, data = ex0920)
summary(m)
## 
## Call:
## lm(formula = Speed ~ Starters + Track, data = ex0920)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93774 -0.27403  0.00487  0.39355  1.22224 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.78998    0.60750  57.268  < 2e-16 ***
## Starters     0.05500    0.01172   4.695 7.88e-06 ***
## TrackFast    0.96275    0.60800   1.583    0.116    
## TrackGood    0.47109    0.63658   0.740    0.461    
## TrackHeavy  -1.18250    0.64855  -1.823    0.071 .  
## TrackMuddy  -0.36418    0.65151  -0.559    0.577    
## TrackSloppy  0.37248    0.67538   0.552    0.582    
## TrackSlow   -0.20402    0.66177  -0.308    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6002 on 108 degrees of freedom
## Multiple R-squared:  0.5815, Adjusted R-squared:  0.5544 
## F-statistic: 21.44 on 7 and 108 DF,  p-value: < 2.2e-16
anova(m0, m)
## Analysis of Variance Table
## 
## Model 1: Speed ~ 1
## Model 2: Speed ~ Starters + Track
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    115 92.971                                  
## 2    108 38.909  7    54.062 21.437 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now take a look at the ANOVA table

anova(m)
## Analysis of Variance Table
## 
## Response: Speed
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Starters    1 15.999 15.9992  44.409 1.162e-09 ***
## Track       6 38.063  6.3439  17.609 3.914e-14 ***
## Residuals 108 38.909  0.3603                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA with multiple explanatory variables

The ANOVA table when multiple explanatory variables are included differs from our one-way ANOVA table.

m <- lm(Speed ~ Starters + Track, data = ex0920)
anova(m)
## Analysis of Variance Table
## 
## Response: Speed
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Starters    1 15.999 15.9992  44.409 1.162e-09 ***
## Track       6 38.063  6.3439  17.609 3.914e-14 ***
## Residuals 108 38.909  0.3603                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To determine what full and reduced models are used in constructing this ANOVA table, we will fit a variety of different models.

m0  <- lm(Speed ~ 1,                data = ex0920)
mS  <- lm(Speed ~ Starters,         data = ex0920)
mT  <- lm(Speed ~            Track, data = ex0920)
mST <- lm(Speed ~ Starters + Track, data = ex0920)

You might think the ANOVA table compares

  1. intercept-only model to the model with Starters
  2. the Starters model to the Starters+Track model
anova(m0, mS ) # we are looking to add (S)tarters to the model
## Analysis of Variance Table
## 
## Model 1: Speed ~ 1
## Model 2: Speed ~ Starters
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    115 92.971                                  
## 2    114 76.972  1    15.999 23.696 3.659e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mS, mST) # we are looking to add (T)rack    to the model
## Analysis of Variance Table
## 
## Model 1: Speed ~ Starters
## Model 2: Speed ~ Starters + Track
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    114 76.972                                  
## 2    108 38.909  6    38.063 17.609 3.914e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get identical results for the second comparison but not for the first.

The reason is what estimate of the error variance, \(\hat\sigma^2\), is used. If the models are all put together, then the error variance estimate is based on the fullest model in the list. So the following will recreate the full ANOVA table.

anova(m0, mS, mST)
## Analysis of Variance Table
## 
## Model 1: Speed ~ 1
## Model 2: Speed ~ Starters
## Model 3: Speed ~ Starters + Track
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    115 92.971                                  
## 2    114 76.972  1    15.999 44.409 1.162e-09 ***
## 3    108 38.909  6    38.063 17.609 3.914e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This method is more powerful than the comparing models one at a time and is therefore preferred.

Activity

  1. Recreate the following ANOVA table by fitting each model separately and using the anova() function to recreate the models.

Bonus:

  1. why is Df only 5 for the interaction?
m  <- lm(Speed ~ Track*Starters, data = Sleuth3::ex0920)
anova(m)
## Analysis of Variance Table
## 
## Response: Speed
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Track            6 46.122  7.6869 21.0717 5.382e-16 ***
## Starters         1  7.941  7.9408 21.7675 9.284e-06 ***
## Track:Starters   5  1.335  0.2669  0.7317    0.6013    
## Residuals      103 37.574  0.3648                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Click for solution

We need to construct each model:

m0  <- lm(Speed ~ 1, data = Sleuth3::ex0920)
mT  <- lm(Speed ~ Track         , data = Sleuth3::ex0920)
mTS <- lm(Speed ~ Track+Starters, data = Sleuth3::ex0920)
m   <- lm(Speed ~ Track*Starters, data = Sleuth3::ex0920)

Then we use anova() to compare the models:

anova(m0, mT, mTS, m)
## Analysis of Variance Table
## 
## Model 1: Speed ~ 1
## Model 2: Speed ~ Track
## Model 3: Speed ~ Track + Starters
## Model 4: Speed ~ Track * Starters
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1    115 92.971                                   
## 2    109 46.850  6    46.122 21.0717 5.382e-16 ***
## 3    108 38.909  1     7.941 21.7675 9.284e-06 ***
## 4    103 37.574  5     1.335  0.7317    0.6013    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dropping one term

The previous ANOVA table is often referred to as a sequential, or Type I, sum of squares because the tests are sequential, e.g. 

  1. should we add Track?
  2. should we add Starters (in addition to Track)?
  3. should we add the interaction between Track and Starters (in addition to Track and Starters)?

Often times we are interested in whether terms should be dropped from the full model.

In this situation, we always have a consistent full model and we are only considering dropping a particular explanatory variable. For example, consider the Kentucky Derby data with main effects for Track and Starters and we want to consider dropping Track or dropping Starters.

m <- lm(Speed ~ Track + Starters, data = Sleuth3::ex0920)
drop1(m, test="F")
## Single term deletions
## 
## Model:
## Speed ~ Track + Starters
##          Df Sum of Sq    RSS      AIC F value    Pr(>F)    
## <none>                38.909 -110.714                      
## Track     6    38.063 76.972  -43.577  17.609 3.914e-14 ***
## Starters  1     7.941 46.850  -91.171  22.041 7.880e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the full model is the model with both Track and Starters. There is a different reduced model for each line:

  • Track line has the reduced model that has no Track explanatory variable.
  • Starters line has the reduced model that has no Starters explanatory variable.

So it is equivalent to the following two comparisons:

mT  <- lm(Speed ~ Track,            data = Sleuth3::ex0920)
mS  <- lm(Speed ~         Starters, data = Sleuth3::ex0920)
mST <- lm(Speed ~ Track + Starters, data = Sleuth3::ex0920)
anova(mS,mST) # Consider adding Track into the model that already has Starters
## Analysis of Variance Table
## 
## Model 1: Speed ~ Starters
## Model 2: Speed ~ Track + Starters
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    114 76.972                                  
## 2    108 38.909  6    38.063 17.609 3.914e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mT,mST) # Consider adding Starters into the model that already has Track
## Analysis of Variance Table
## 
## Model 1: Speed ~ Track
## Model 2: Speed ~ Track + Starters
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    109 46.850                                 
## 2    108 38.909  1    7.9408 22.041 7.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Activity - Recreating ANOVA table

Recreate the following drop1() results:

m <- lm(Ingestion ~ Weight + Organic, data = Sleuth3::ex0921)
drop1(m, test='F')
## Single term deletions
## 
## Model:
## Ingestion ~ Weight + Organic
##         Df Sum of Sq      RSS    AIC F value  Pr(>F)  
## <none>               67301107 292.52                  
## Weight   1  29525368 96826475 297.44  7.0193 0.01749 *
## Organic  1   5833254 73134361 292.10  1.3868 0.25617  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Click for solution
mW  <- lm(Ingestion ~ Weight          , data = Sleuth3::ex0921)
mO  <- lm(Ingestion ~          Organic, data = Sleuth3::ex0921)
mWO <- lm(Ingestion ~ Weight + Organic, data = Sleuth3::ex0921)
anova(mO,mWO)
## Analysis of Variance Table
## 
## Model 1: Ingestion ~ Organic
## Model 2: Ingestion ~ Weight + Organic
##   Res.Df      RSS Df Sum of Sq      F  Pr(>F)  
## 1     17 96826475                              
## 2     16 67301107  1  29525368 7.0193 0.01749 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mW,mWO)
## Analysis of Variance Table
## 
## Model 1: Ingestion ~ Weight
## Model 2: Ingestion ~ Weight + Organic
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     17 73134361                           
## 2     16 67301107  1   5833254 1.3868 0.2562

Including interactions

When interactions are considered, there is a rule of thumb that says that if an interaction is in the model, you should keep the main effects in the model. Thus, when you run drop1 on models with an interaction, it will only attempt to drop the interaction.

m <- lm(Ingestion ~ Weight * Organic, data = Sleuth3::ex0921)
drop1(m, test = 'F')
## Single term deletions
## 
## Model:
## Ingestion ~ Weight * Organic
##                Df Sum of Sq      RSS    AIC F value Pr(>F)
## <none>                      61739502 292.89               
## Weight:Organic  1   5561605 67301107 292.52  1.3512 0.2632

It is not obvious from this output that you are comparing the model that contains the main effects with the model that additional adds the interaction. You can check this with the anova function.

m0 <- lm(Ingestion ~ Weight + Organic, data = Sleuth3::ex0921)
anova(m0,m)
## Analysis of Variance Table
## 
## Model 1: Ingestion ~ Weight + Organic
## Model 2: Ingestion ~ Weight * Organic
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1     16 67301107                           
## 2     15 61739502  1   5561605 1.3512 0.2632

Notice that the statistics are all the same as the drop1 results.

Model construction strategy

When constructing a model, follow these prioritized rules

  1. Include explanatory variables that are part of the design.
  2. Include explanatory variables that answer questions of scientific interest.
  3. Include lower order terms, e.g. main effects when interactions are present and linear terms when quadratic terms are present.
  4. Include ``significant’’ explanatory variables.

In particular, do NOT remove insignificant design variables.