To follow along, use the lab12 code and make sure the following packag is installed:
You can use the following code to perform the installation:
install.packages("lme4")
Now load the packages
library("dplyr")
library("ggplot2")
# library("Sleuth3")
library("lme4")
Generalized linear models are a class of models that include
as specific examples. The glm()
function allows us to fit all of these models, although we usually use the lm()
function for the special (and important) case of linear regression.
The linear regression model can be fit using either the glm()
or lm()
function, but running summary()
on the lm()
function analysis is more informative.
m_lm <- lm( Velocity ~ Distance, data = Sleuth3::case0701)
m_glm <- glm(Velocity ~ Distance, data = Sleuth3::case0701)
where the default family in the glm()
function is normal
.
summary(m_lm)
##
## Call:
## lm(formula = Velocity ~ Distance, data = Sleuth3::case0701)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398.02 -157.03 -12.88 148.14 506.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.25 83.52 -0.482 0.635
## Distance 453.63 75.30 6.024 4.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233.2 on 22 degrees of freedom
## Multiple R-squared: 0.6226, Adjusted R-squared: 0.6054
## F-statistic: 36.29 on 1 and 22 DF, p-value: 4.608e-06
summary(m_glm)
##
## Call:
## glm(formula = Velocity ~ Distance, data = Sleuth3::case0701)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -398.02 -157.03 -12.88 148.14 506.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.25 83.52 -0.482 0.635
## Distance 453.63 75.30 6.024 4.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 54385.57)
##
## Null deviance: 3170091 on 23 degrees of freedom
## Residual deviance: 1196482 on 22 degrees of freedom
## AIC: 333.71
##
## Number of Fisher Scoring iterations: 2
The dispersion parameter is the estimate of the variance.
Logistic regression can be accomplished using the glm()
function and setting the family
argument to binomial
. There are two different ways to set up the logistic regression model fit depending on whether the data are grouped or not.
For ungrouped data, the response can be a factor where the first level is failure and all other levels are success.
levels(Sleuth3::case2002$LC)
## [1] "LungCancer" "NoCancer"
m <- glm(LC ~ CD, data = Sleuth3::case2002, family=binomial)
m
##
## Call: glm(formula = LC ~ CD, family = binomial, data = Sleuth3::case2002)
##
## Coefficients:
## (Intercept) CD
## 1.53541 -0.05113
##
## Degrees of Freedom: 146 Total (i.e. Null); 145 Residual
## Null Deviance: 187.1
## Residual Deviance: 179.6 AIC: 183.6
Thus, this logistic regression analysis has success'' being
NoCancer’’.
It is better practice to be explicit. You can be explicit by using an equality and thus treating the response as TRUE/FALSE.
m <- glm(LC == "LungCancer" ~ CD,
data = Sleuth3::case2002,
family = binomial)
m
##
## Call: glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = Sleuth3::case2002)
##
## Coefficients:
## (Intercept) CD
## -1.53541 0.05113
##
## Degrees of Freedom: 146 Total (i.e. Null); 145 Residual
## Null Deviance: 187.1
## Residual Deviance: 179.6 AIC: 183.6
Care needs to be taken with this approach because any value other than ``LungCancer’’ will be treated as a failure, e.g. typos.
Finally, we can use a 0-1 coding where 1 is a success and 0 is failure.
d <- Sleuth3::case2002 %>%
mutate(lung_cancer = 1*(LC == "LungCancer"))
m <- glm(LC == "LungCancer" ~ CD,
data = d,
family = binomial)
m
##
## Call: glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = d)
##
## Coefficients:
## (Intercept) CD
## -1.53541 0.05113
##
## Degrees of Freedom: 146 Total (i.e. Null); 145 Residual
## Null Deviance: 187.1
## Residual Deviance: 179.6 AIC: 183.6
To obtain a better summary use summary
().
summary(m)
##
## Call:
## glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5148 -0.9688 -0.7166 1.3449 1.8603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.53541 0.37707 -4.072 4.66e-05 ***
## CD 0.05113 0.01939 2.637 0.00836 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 187.14 on 146 degrees of freedom
## Residual deviance: 179.62 on 145 degrees of freedom
## AIC: 183.62
##
## Number of Fisher Scoring iterations: 4
Fit a logistic regression model to determine the relationship between survival in the Donner Party (case2001
) and the age and sex of the individual.
m <- glm(Status == "Survived" ~ Age + Sex,
data = Sleuth3::case2001,
family = binomial)
summary(m)
##
## Call:
## glm(formula = Status == "Survived" ~ Age + Sex, family = binomial,
## data = Sleuth3::case2001)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7445 -1.0441 -0.3029 0.8877 2.0472
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.23041 1.38686 2.329 0.0198 *
## Age -0.07820 0.03728 -2.097 0.0359 *
## SexMale -1.59729 0.75547 -2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.827 on 44 degrees of freedom
## Residual deviance: 51.256 on 42 degrees of freedom
## AIC: 57.256
##
## Number of Fisher Scoring iterations: 4
Often times data are grouped by explanatory variable values. For example, we can group the lung cancer data set by the number of cigarettes per day. The most typical way these data are stored is to have the number of observations and the proportion of them that were ``successful’’.
lung_grouped <- Sleuth3::case2002 %>%
group_by(CD) %>%
summarize(n = n(),
y = sum(LC == "LungCancer"))
lung_grouped
## # A tibble: 19 x 3
## CD n y
## * <int> <int> <int>
## 1 0 17 1
## 2 1 2 0
## 3 2 2 0
## 4 3 1 1
## 5 4 2 0
## 6 5 2 0
## 7 6 3 1
## 8 8 2 0
## 9 10 15 4
## 10 12 5 1
## 11 15 27 12
## 12 16 1 1
## 13 18 2 0
## 14 20 38 16
## 15 25 15 7
## 16 30 7 3
## 17 37 1 0
## 18 40 4 2
## 19 45 1 0
You can still fit the regression model using the glm()
function, but you need to using the cbind()
function as the response where you have the first term is the number of successes and the second term is the number of failures. So the logistic regression model fit looks like. (Be careful not to use the c()
function or you will get a ``variable lengths differ’’ error.)
m <- glm(cbind(y,n-y) ~ CD,
data = lung_grouped,
family = binomial)
summary(m)
##
## Call:
## glm(formula = cbind(y, n - y) ~ CD, family = binomial, data = lung_grouped)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5148 -1.0253 -0.5070 0.3305 1.7922
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.53541 0.37707 -4.072 4.66e-05 ***
## CD 0.05113 0.01939 2.637 0.00836 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.651 on 18 degrees of freedom
## Residual deviance: 21.141 on 17 degrees of freedom
## AIC: 48.879
##
## Number of Fisher Scoring iterations: 4
Group the Donner party data (Sleuth3::case2001
) by age and sex. Then re-run the logistic regression analysis from the previous activity on these grouped data.
donner_grouped <- Sleuth3::case2001 %>%
group_by(Age,Sex) %>%
summarize(y = sum(Status == "Survived"),
n = n())
## `summarise()` has grouped output by 'Age'. You can override using the `.groups` argument.
m <- glm(cbind(y,n-y) ~ Age + Sex,
data = donner_grouped,
family = binomial)
summary(m)
##
## Call:
## glm(formula = cbind(y, n - y) ~ Age + Sex, family = binomial,
## data = donner_grouped)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4955 -0.7860 0.1202 0.7618 2.0472
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.23041 1.38697 2.329 0.0199 *
## Age -0.07820 0.03729 -2.097 0.0360 *
## SexMale -1.59729 0.75550 -2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37.012 on 27 degrees of freedom
## Residual deviance: 26.441 on 25 degrees of freedom
## AIC: 42.347
##
## Number of Fisher Scoring iterations: 5
To run a Poisson regression in R, use the glm()
function with the family
argument set to poisson
.
m <- glm(Matings ~ Age,
data = Sleuth3::case2201,
family = poisson)
summary(m)
##
## Call:
## glm(formula = Matings ~ Age, family = poisson, data = Sleuth3::case2201)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## Age 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
Investigate the relationship between number of salamanders as a function of forest age using a Poisson regression analysis. Data are available in `Sleuth3::case2202’.
m <- glm(Salamanders ~ ForestAge,
data = Sleuth3::case2202,
family = poisson)
summary(m)
##
## Call:
## glm(formula = Salamanders ~ ForestAge, family = poisson, data = Sleuth3::case2202)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6970 -1.8539 -0.7987 0.2144 4.4582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5040207 0.1401385 3.597 0.000322 ***
## ForestAge 0.0019151 0.0004155 4.609 4.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 190.22 on 46 degrees of freedom
## Residual deviance: 170.65 on 45 degrees of freedom
## AIC: 259.7
##
## Number of Fisher Scoring iterations: 6
To compare nested linear models, we used the F-test through the R function anova()
. For GLMs, we will use a likelihood ratio test through the same function. Asymptotically (i.e. as you have infinite data), the statistic for this test has a chi-squared distribution. Thus, we specify this test using the anova()
table setting the test
argument to “Chi”.
Here is a test to determine whether we should have included an interaction between birdkeeping and the number of cigarettes per day.
mA <- glm(LC ~ BK + CD, data = Sleuth3::case2002, family = binomial)
mI <- glm(LC ~ BK * CD, data = Sleuth3::case2002, family = binomial)
anova(mA,mI, test="Chi")
## Analysis of Deviance Table
##
## Model 1: LC ~ BK + CD
## Model 2: LC ~ BK * CD
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 144 164.36
## 2 143 164.35 1 0.0068661 0.934
Note that the title of this table is “Analysis of Deviance” which indicates you are working with GLMs rather than linear models.
Use a likelihood ratio test to determine whether an interaction between Age and Sex should be included in the logistic regression model of survival probability in the Donner party.
mA <- glm(Status ~ Age + Sex, data = Sleuth3::case2001, family = binomial)
mI <- glm(Status ~ Age * Sex, data = Sleuth3::case2001, family = binomial)
anova(mA,mI, test="Chi")
## Analysis of Deviance Table
##
## Model 1: Status ~ Age + Sex
## Model 2: Status ~ Age * Sex
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 42 51.256
## 2 41 47.346 1 3.9099 0.048 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To fit a mixed effect model in R, use the lme4 package.
For a linear regression model with random effects, use the lmer
() function.
Here is a random intercept model.
library("lme4")
m <- lmer(Reaction ~ Days + (1| Subject), data = sleepstudy)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1786.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2257 -0.5529 0.0109 0.5188 4.2506
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 1378.2 37.12
## Residual 960.5 30.99
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.4051 9.7467 25.79
## Days 10.4673 0.8042 13.02
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.371
Here is a random slope model
m <- lmer(Reaction ~ Days + (Days| Subject), data = sleepstudy)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
## Data: sleepstudy
##
## REML criterion at convergence: 1743.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9536 -0.4634 0.0231 0.4634 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 612.10 24.741
## Days 35.07 5.922 0.07
## Residual 654.94 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.825 36.838
## Days 10.467 1.546 6.771
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
For any generalized linear model, e.g. logistic and Poisson regression, you can add random effects using the glmer()
function.
Random intercept model.
m <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)
summary(m)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
## glmerMod]
## Family: binomial ( logit )
## Formula: cbind(incidence, size - incidence) ~ period + (1 | herd)
## Data: cbpp
##
## AIC BIC logLik deviance df.resid
## 194.1 204.2 -92.0 184.1 51
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3816 -0.7889 -0.2026 0.5142 2.8791
##
## Random effects:
## Groups Name Variance Std.Dev.
## herd (Intercept) 0.4123 0.6421
## Number of obs: 56, groups: herd, 15
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3983 0.2312 -6.048 1.47e-09 ***
## period2 -0.9919 0.3032 -3.272 0.001068 **
## period3 -1.1282 0.3228 -3.495 0.000474 ***
## period4 -1.5797 0.4220 -3.743 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) perid2 perid3
## period2 -0.363
## period3 -0.340 0.280
## period4 -0.260 0.213 0.198