To follow along, use the lab10 code.
Now load the packages
library("tidyverse")
library("Sleuth3")
To run an ANOVA in R, use the anova()
function on an lm
object.
m <- lm(Lifetime ~ Diet, data = case0501)
anova(m)
## Analysis of Variance Table
##
## Response: Lifetime
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 5 12734 2546.8 57.104 < 2.2e-16 ***
## Residuals 343 15297 44.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note a few things about this table:
This F-test also shows up in the R regression output:
summary(m)
##
## Call:
## lm(formula = Lifetime ~ Diet, data = case0501)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5167 -3.3857 0.8143 5.1833 10.0143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.6912 0.8846 36.958 < 2e-16 ***
## DietN/R40 12.4254 1.2352 10.059 < 2e-16 ***
## DietN/R50 9.6060 1.1877 8.088 1.06e-14 ***
## DietNP -5.2892 1.3010 -4.065 5.95e-05 ***
## DietR/R50 10.1945 1.2565 8.113 8.88e-15 ***
## Dietlopro 6.9945 1.2565 5.567 5.25e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared: 0.4543, Adjusted R-squared: 0.4463
## F-statistic: 57.1 on 5 and 343 DF, p-value: < 2.2e-16
where the last line provides the F-statistic, the numerator and denominator degrees of freedom, and the associated p-value. The two models being compared here are the intercept-only model and the model with all explanatory variables (in this case the indicator functions for all but one Diet).
This F-test can be constructed manually by fitting these two models and using the anova()
function to compare the models.
m0 <- lm(Lifetime ~ 1, data = case0501) # 1 indicates intercept
anova(m0, m)
## Analysis of Variance Table
##
## Model 1: Lifetime ~ 1
## Model 2: Lifetime ~ Diet
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 348 28031
## 2 343 15297 5 12734 57.104 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This table contains the information from the intercept-only model in row 1 and the information from the model with Diet, as a categorical variable, in row 2.
The ex0920
data set contains information on the Kentucky Derby winners from 1896-2011.
summary()
output of this multiple regression model compares the intercept-only model with the model with all of the explanatory variables.Bonus
m0 <- lm(Speed ~ 1, data = ex0920)
m <- lm(Speed ~ Starters + Track, data = ex0920)
summary(m)
##
## Call:
## lm(formula = Speed ~ Starters + Track, data = ex0920)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.93774 -0.27403 0.00487 0.39355 1.22224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.78998 0.60750 57.268 < 2e-16 ***
## Starters 0.05500 0.01172 4.695 7.88e-06 ***
## TrackFast 0.96275 0.60800 1.583 0.116
## TrackGood 0.47109 0.63658 0.740 0.461
## TrackHeavy -1.18250 0.64855 -1.823 0.071 .
## TrackMuddy -0.36418 0.65151 -0.559 0.577
## TrackSloppy 0.37248 0.67538 0.552 0.582
## TrackSlow -0.20402 0.66177 -0.308 0.758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6002 on 108 degrees of freedom
## Multiple R-squared: 0.5815, Adjusted R-squared: 0.5544
## F-statistic: 21.44 on 7 and 108 DF, p-value: < 2.2e-16
anova(m0, m)
## Analysis of Variance Table
##
## Model 1: Speed ~ 1
## Model 2: Speed ~ Starters + Track
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 115 92.971
## 2 108 38.909 7 54.062 21.437 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now take a look at the ANOVA table
anova(m)
## Analysis of Variance Table
##
## Response: Speed
## Df Sum Sq Mean Sq F value Pr(>F)
## Starters 1 15.999 15.9992 44.409 1.162e-09 ***
## Track 6 38.063 6.3439 17.609 3.914e-14 ***
## Residuals 108 38.909 0.3603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table when multiple explanatory variables are included differs from our one-way ANOVA table.
m <- lm(Speed ~ Starters + Track, data = ex0920)
anova(m)
## Analysis of Variance Table
##
## Response: Speed
## Df Sum Sq Mean Sq F value Pr(>F)
## Starters 1 15.999 15.9992 44.409 1.162e-09 ***
## Track 6 38.063 6.3439 17.609 3.914e-14 ***
## Residuals 108 38.909 0.3603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To determine what full and reduced models are used in constructing this ANOVA table, we will fit a variety of different models.
m0 <- lm(Speed ~ 1, data = ex0920)
mS <- lm(Speed ~ Starters, data = ex0920)
mT <- lm(Speed ~ Track, data = ex0920)
mST <- lm(Speed ~ Starters + Track, data = ex0920)
You might think the ANOVA table compares
anova(m0, mS ) # we are looking to add (S)tarters to the model
## Analysis of Variance Table
##
## Model 1: Speed ~ 1
## Model 2: Speed ~ Starters
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 115 92.971
## 2 114 76.972 1 15.999 23.696 3.659e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mS, mST) # we are looking to add (T)rack to the model
## Analysis of Variance Table
##
## Model 1: Speed ~ Starters
## Model 2: Speed ~ Starters + Track
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 114 76.972
## 2 108 38.909 6 38.063 17.609 3.914e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We get identical results for the second comparison but not for the first.
The reason is what estimate of the error variance, \(\hat\sigma^2\), is used. If the models are all put together, then the error variance estimate is based on the fullest model in the list. So the following will recreate the full ANOVA table.
anova(m0, mS, mST)
## Analysis of Variance Table
##
## Model 1: Speed ~ 1
## Model 2: Speed ~ Starters
## Model 3: Speed ~ Starters + Track
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 115 92.971
## 2 114 76.972 1 15.999 44.409 1.162e-09 ***
## 3 108 38.909 6 38.063 17.609 3.914e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This method is more powerful than the comparing models one at a time and is therefore preferred.
anova()
function to recreate the models.Bonus:
m <- lm(Speed ~ Track*Starters, data = Sleuth3::ex0920)
anova(m)
## Analysis of Variance Table
##
## Response: Speed
## Df Sum Sq Mean Sq F value Pr(>F)
## Track 6 46.122 7.6869 21.0717 5.382e-16 ***
## Starters 1 7.941 7.9408 21.7675 9.284e-06 ***
## Track:Starters 5 1.335 0.2669 0.7317 0.6013
## Residuals 103 37.574 0.3648
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We need to construct each model:
m0 <- lm(Speed ~ 1, data = Sleuth3::ex0920)
mT <- lm(Speed ~ Track , data = Sleuth3::ex0920)
mTS <- lm(Speed ~ Track+Starters, data = Sleuth3::ex0920)
m <- lm(Speed ~ Track*Starters, data = Sleuth3::ex0920)
Then we use anova()
to compare the models:
anova(m0, mT, mTS, m)
## Analysis of Variance Table
##
## Model 1: Speed ~ 1
## Model 2: Speed ~ Track
## Model 3: Speed ~ Track + Starters
## Model 4: Speed ~ Track * Starters
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 115 92.971
## 2 109 46.850 6 46.122 21.0717 5.382e-16 ***
## 3 108 38.909 1 7.941 21.7675 9.284e-06 ***
## 4 103 37.574 5 1.335 0.7317 0.6013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The previous ANOVA table is often referred to as a sequential, or Type I, sum of squares because the tests are sequential, e.g.
Often times we are interested in whether terms should be dropped from the full model.
In this situation, we always have a consistent full model and we are only considering dropping a particular explanatory variable. For example, consider the Kentucky Derby data with main effects for Track and Starters and we want to consider dropping Track or dropping Starters.
m <- lm(Speed ~ Track + Starters, data = Sleuth3::ex0920)
drop1(m, test="F")
## Single term deletions
##
## Model:
## Speed ~ Track + Starters
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 38.909 -110.714
## Track 6 38.063 76.972 -43.577 17.609 3.914e-14 ***
## Starters 1 7.941 46.850 -91.171 22.041 7.880e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here the full model is the model with both Track and Starters. There is a different reduced model for each line:
So it is equivalent to the following two comparisons:
mT <- lm(Speed ~ Track, data = Sleuth3::ex0920)
mS <- lm(Speed ~ Starters, data = Sleuth3::ex0920)
mST <- lm(Speed ~ Track + Starters, data = Sleuth3::ex0920)
anova(mS,mST) # Consider adding Track into the model that already has Starters
## Analysis of Variance Table
##
## Model 1: Speed ~ Starters
## Model 2: Speed ~ Track + Starters
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 114 76.972
## 2 108 38.909 6 38.063 17.609 3.914e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mT,mST) # Consider adding Starters into the model that already has Track
## Analysis of Variance Table
##
## Model 1: Speed ~ Track
## Model 2: Speed ~ Track + Starters
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109 46.850
## 2 108 38.909 1 7.9408 22.041 7.88e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recreate the following drop1()
results:
m <- lm(Ingestion ~ Weight + Organic, data = Sleuth3::ex0921)
drop1(m, test='F')
## Single term deletions
##
## Model:
## Ingestion ~ Weight + Organic
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 67301107 292.52
## Weight 1 29525368 96826475 297.44 7.0193 0.01749 *
## Organic 1 5833254 73134361 292.10 1.3868 0.25617
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mW <- lm(Ingestion ~ Weight , data = Sleuth3::ex0921)
mO <- lm(Ingestion ~ Organic, data = Sleuth3::ex0921)
mWO <- lm(Ingestion ~ Weight + Organic, data = Sleuth3::ex0921)
anova(mO,mWO)
## Analysis of Variance Table
##
## Model 1: Ingestion ~ Organic
## Model 2: Ingestion ~ Weight + Organic
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 96826475
## 2 16 67301107 1 29525368 7.0193 0.01749 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mW,mWO)
## Analysis of Variance Table
##
## Model 1: Ingestion ~ Weight
## Model 2: Ingestion ~ Weight + Organic
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 73134361
## 2 16 67301107 1 5833254 1.3868 0.2562
When interactions are considered, there is a rule of thumb that says that if an interaction is in the model, you should keep the main effects in the model. Thus, when you run drop1 on models with an interaction, it will only attempt to drop the interaction.
m <- lm(Ingestion ~ Weight * Organic, data = Sleuth3::ex0921)
drop1(m, test = 'F')
## Single term deletions
##
## Model:
## Ingestion ~ Weight * Organic
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 61739502 292.89
## Weight:Organic 1 5561605 67301107 292.52 1.3512 0.2632
It is not obvious from this output that you are comparing the model that contains the main effects with the model that additional adds the interaction. You can check this with the anova
function.
m0 <- lm(Ingestion ~ Weight + Organic, data = Sleuth3::ex0921)
anova(m0,m)
## Analysis of Variance Table
##
## Model 1: Ingestion ~ Weight + Organic
## Model 2: Ingestion ~ Weight * Organic
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 16 67301107
## 2 15 61739502 1 5561605 1.3512 0.2632
Notice that the statistics are all the same as the drop1
results.
When constructing a model, follow these prioritized rules
In particular, do NOT remove insignificant design variables.