Preparation

To follow along, use the lab12 code and make sure the following packag is installed:

You can use the following code to perform the installation:

install.packages("lme4")

Now load the packages

library("dplyr")
library("ggplot2")
# library("Sleuth3")
library("lme4")

Generalized linear models

Generalized linear models are a class of models that include

as specific examples. The glm() function allows us to fit all of these models, although we usually use the lm() function for the special (and important) case of linear regression.

Linear regression

The linear regression model can be fit using either the glm() or lm() function, but running summary() on the lm() function analysis is more informative.

m_lm  <- lm( Velocity ~ Distance, data = Sleuth3::case0701)
m_glm <- glm(Velocity ~ Distance, data = Sleuth3::case0701)

where the default family in the glm() function is normal.

summary(m_lm)
## 
## Call:
## lm(formula = Velocity ~ Distance, data = Sleuth3::case0701)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398.02 -157.03  -12.88  148.14  506.61 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -40.25      83.52  -0.482    0.635    
## Distance      453.63      75.30   6.024 4.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 233.2 on 22 degrees of freedom
## Multiple R-squared:  0.6226, Adjusted R-squared:  0.6054 
## F-statistic: 36.29 on 1 and 22 DF,  p-value: 4.608e-06
summary(m_glm)
## 
## Call:
## glm(formula = Velocity ~ Distance, data = Sleuth3::case0701)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -398.02  -157.03   -12.88   148.14   506.61  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -40.25      83.52  -0.482    0.635    
## Distance      453.63      75.30   6.024 4.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 54385.57)
## 
##     Null deviance: 3170091  on 23  degrees of freedom
## Residual deviance: 1196482  on 22  degrees of freedom
## AIC: 333.71
## 
## Number of Fisher Scoring iterations: 2

The dispersion parameter is the estimate of the variance.

Logistic regression

Logistic regression can be accomplished using the glm() function and setting the family argument to binomial. There are two different ways to set up the logistic regression model fit depending on whether the data are grouped or not.

Ungrouped data

For ungrouped data, the response can be a factor where the first level is failure and all other levels are success.

levels(Sleuth3::case2002$LC)
## [1] "LungCancer" "NoCancer"
m <- glm(LC ~ CD, data = Sleuth3::case2002, family=binomial)
m
## 
## Call:  glm(formula = LC ~ CD, family = binomial, data = Sleuth3::case2002)
## 
## Coefficients:
## (Intercept)           CD  
##     1.53541     -0.05113  
## 
## Degrees of Freedom: 146 Total (i.e. Null);  145 Residual
## Null Deviance:       187.1 
## Residual Deviance: 179.6     AIC: 183.6

Thus, this logistic regression analysis has success'' beingNoCancer’’.

It is better practice to be explicit. You can be explicit by using an equality and thus treating the response as TRUE/FALSE.

m <- glm(LC == "LungCancer" ~ CD, 
         data = Sleuth3::case2002, 
         family = binomial)
m
## 
## Call:  glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = Sleuth3::case2002)
## 
## Coefficients:
## (Intercept)           CD  
##    -1.53541      0.05113  
## 
## Degrees of Freedom: 146 Total (i.e. Null);  145 Residual
## Null Deviance:       187.1 
## Residual Deviance: 179.6     AIC: 183.6

Care needs to be taken with this approach because any value other than ``LungCancer’’ will be treated as a failure, e.g. typos.

Finally, we can use a 0-1 coding where 1 is a success and 0 is failure.

d <- Sleuth3::case2002 %>%
  mutate(lung_cancer = 1*(LC == "LungCancer"))
m <- glm(LC == "LungCancer" ~ CD, 
         data = d, 
         family = binomial)
m
## 
## Call:  glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = d)
## 
## Coefficients:
## (Intercept)           CD  
##    -1.53541      0.05113  
## 
## Degrees of Freedom: 146 Total (i.e. Null);  145 Residual
## Null Deviance:       187.1 
## Residual Deviance: 179.6     AIC: 183.6

To obtain a better summary use summary().

summary(m)
## 
## Call:
## glm(formula = LC == "LungCancer" ~ CD, family = binomial, data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5148  -0.9688  -0.7166   1.3449   1.8603  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.53541    0.37707  -4.072 4.66e-05 ***
## CD           0.05113    0.01939   2.637  0.00836 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 187.14  on 146  degrees of freedom
## Residual deviance: 179.62  on 145  degrees of freedom
## AIC: 183.62
## 
## Number of Fisher Scoring iterations: 4

Activity

Fit a logistic regression model to determine the relationship between survival in the Donner Party (case2001) and the age and sex of the individual.

Click for solution
m <- glm(Status == "Survived" ~ Age + Sex,
         data = Sleuth3::case2001,
         family = binomial)
summary(m)
## 
## Call:
## glm(formula = Status == "Survived" ~ Age + Sex, family = binomial, 
##     data = Sleuth3::case2001)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7445  -1.0441  -0.3029   0.8877   2.0472  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  3.23041    1.38686   2.329   0.0198 *
## Age         -0.07820    0.03728  -2.097   0.0359 *
## SexMale     -1.59729    0.75547  -2.114   0.0345 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.827  on 44  degrees of freedom
## Residual deviance: 51.256  on 42  degrees of freedom
## AIC: 57.256
## 
## Number of Fisher Scoring iterations: 4

Grouped data

Often times data are grouped by explanatory variable values. For example, we can group the lung cancer data set by the number of cigarettes per day. The most typical way these data are stored is to have the number of observations and the proportion of them that were ``successful’’.

lung_grouped <- Sleuth3::case2002 %>%
  group_by(CD) %>%
  summarize(n = n(),
            y = sum(LC == "LungCancer"))

lung_grouped
## # A tibble: 19 x 3
##       CD     n     y
##  * <int> <int> <int>
##  1     0    17     1
##  2     1     2     0
##  3     2     2     0
##  4     3     1     1
##  5     4     2     0
##  6     5     2     0
##  7     6     3     1
##  8     8     2     0
##  9    10    15     4
## 10    12     5     1
## 11    15    27    12
## 12    16     1     1
## 13    18     2     0
## 14    20    38    16
## 15    25    15     7
## 16    30     7     3
## 17    37     1     0
## 18    40     4     2
## 19    45     1     0

You can still fit the regression model using the glm() function, but you need to using the cbind() function as the response where you have the first term is the number of successes and the second term is the number of failures. So the logistic regression model fit looks like. (Be careful not to use the c() function or you will get a ``variable lengths differ’’ error.)

m <- glm(cbind(y,n-y) ~ CD, 
         data = lung_grouped, 
         family = binomial)
summary(m)
## 
## Call:
## glm(formula = cbind(y, n - y) ~ CD, family = binomial, data = lung_grouped)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5148  -1.0253  -0.5070   0.3305   1.7922  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.53541    0.37707  -4.072 4.66e-05 ***
## CD           0.05113    0.01939   2.637  0.00836 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.651  on 18  degrees of freedom
## Residual deviance: 21.141  on 17  degrees of freedom
## AIC: 48.879
## 
## Number of Fisher Scoring iterations: 4

Activity

Group the Donner party data (Sleuth3::case2001) by age and sex. Then re-run the logistic regression analysis from the previous activity on these grouped data.

Click for solution
donner_grouped <- Sleuth3::case2001 %>%
  group_by(Age,Sex) %>%
  summarize(y = sum(Status == "Survived"),
         n = n())
## `summarise()` has grouped output by 'Age'. You can override using the `.groups` argument.
m <- glm(cbind(y,n-y) ~ Age + Sex,
         data = donner_grouped,
         family = binomial)

summary(m)
## 
## Call:
## glm(formula = cbind(y, n - y) ~ Age + Sex, family = binomial, 
##     data = donner_grouped)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4955  -0.7860   0.1202   0.7618   2.0472  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  3.23041    1.38697   2.329   0.0199 *
## Age         -0.07820    0.03729  -2.097   0.0360 *
## SexMale     -1.59729    0.75550  -2.114   0.0345 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37.012  on 27  degrees of freedom
## Residual deviance: 26.441  on 25  degrees of freedom
## AIC: 42.347
## 
## Number of Fisher Scoring iterations: 5

Poisson regression

To run a Poisson regression in R, use the glm() function with the family argument set to poisson.

m <- glm(Matings ~ Age,
         data = Sleuth3::case2201,
         family = poisson)
summary(m)
## 
## Call:
## glm(formula = Matings ~ Age, family = poisson, data = Sleuth3::case2201)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## Age          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

Activity

Investigate the relationship between number of salamanders as a function of forest age using a Poisson regression analysis. Data are available in `Sleuth3::case2202’.

Click for solution
m <- glm(Salamanders ~ ForestAge, 
         data = Sleuth3::case2202,
         family = poisson)

summary(m)
## 
## Call:
## glm(formula = Salamanders ~ ForestAge, family = poisson, data = Sleuth3::case2202)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6970  -1.8539  -0.7987   0.2144   4.4582  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.5040207  0.1401385   3.597 0.000322 ***
## ForestAge   0.0019151  0.0004155   4.609 4.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 190.22  on 46  degrees of freedom
## Residual deviance: 170.65  on 45  degrees of freedom
## AIC: 259.7
## 
## Number of Fisher Scoring iterations: 6

Comparing generalized linear models

To compare nested linear models, we used the F-test through the R function anova(). For GLMs, we will use a likelihood ratio test through the same function. Asymptotically (i.e. as you have infinite data), the statistic for this test has a chi-squared distribution. Thus, we specify this test using the anova() table setting the test argument to “Chi”.

Here is a test to determine whether we should have included an interaction between birdkeeping and the number of cigarettes per day.

mA <- glm(LC ~ BK + CD, data = Sleuth3::case2002, family = binomial)
mI <- glm(LC ~ BK * CD, data = Sleuth3::case2002, family = binomial)
anova(mA,mI, test="Chi")
## Analysis of Deviance Table
## 
## Model 1: LC ~ BK + CD
## Model 2: LC ~ BK * CD
##   Resid. Df Resid. Dev Df  Deviance Pr(>Chi)
## 1       144     164.36                      
## 2       143     164.35  1 0.0068661    0.934

Note that the title of this table is “Analysis of Deviance” which indicates you are working with GLMs rather than linear models.

Activity

Use a likelihood ratio test to determine whether an interaction between Age and Sex should be included in the logistic regression model of survival probability in the Donner party.

Click for solution
mA <- glm(Status ~ Age + Sex, data = Sleuth3::case2001, family = binomial)
mI <- glm(Status ~ Age * Sex, data = Sleuth3::case2001, family = binomial)
anova(mA,mI, test="Chi")
## Analysis of Deviance Table
## 
## Model 1: Status ~ Age + Sex
## Model 2: Status ~ Age * Sex
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        42     51.256                       
## 2        41     47.346  1   3.9099    0.048 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Generalized linear mixed effect models

To fit a mixed effect model in R, use the lme4 package.

Linear mixed effect models

For a linear regression model with random effects, use the lmer() function.

Here is a random intercept model.

library("lme4")
m <- lmer(Reaction ~ Days + (1| Subject), data = sleepstudy)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 251.4051     9.7467   25.79
## Days         10.4673     0.8042   13.02
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371

Here is a random slope model

m <- lmer(Reaction ~ Days + (Days| Subject), data = sleepstudy)
summary(m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.10   24.741       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825  36.838
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

Generalized linear mixed effect models

For any generalized linear model, e.g. logistic and Poisson regression, you can add random effects using the glmer() function.

Random intercept model.

m <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
              data = cbpp, family = binomial)
summary(m)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [
## glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(incidence, size - incidence) ~ period + (1 | herd)
##    Data: cbpp
## 
##      AIC      BIC   logLik deviance df.resid 
##    194.1    204.2    -92.0    184.1       51 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3816 -0.7889 -0.2026  0.5142  2.8791 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  herd   (Intercept) 0.4123   0.6421  
## Number of obs: 56, groups:  herd, 15
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.3983     0.2312  -6.048 1.47e-09 ***
## period2      -0.9919     0.3032  -3.272 0.001068 ** 
## period3      -1.1282     0.3228  -3.495 0.000474 ***
## period4      -1.5797     0.4220  -3.743 0.000182 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr) perid2 perid3
## period2 -0.363              
## period3 -0.340  0.280       
## period4 -0.260  0.213  0.198