R code

library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("Sleuth3")
library("ggResidpanel")
library("emmeans")

Poisson Regression

Model

Let \(Y_i\) be a Poisson response variable. Assume \[ Y_i \stackrel{ind}{\sim} Po(\lambda_i), \quad \eta_i = \log\left(\lambda_i \right) = \beta_0 + \beta_1X_{i,1} + \cdots + \beta_{p-1} X_{i,p-1} \]

If we invert, then \[ E[Y_i] = e^{\beta_0 + \beta_1X_{i,1} + \cdots + \beta_{p-1} X_{i,p-1}}. \]

glm()

To fit a Poisson regression model in R, you use the glm() function with argument family = "poisson".

summary(ex1509)
##       Year         Sunspots     
##  Min.   :1700   Min.   :  0.00  
##  1st Qu.:1778   1st Qu.: 16.00  
##  Median :1855   Median : 40.00  
##  Mean   :1855   Mean   : 49.57  
##  3rd Qu.:1932   3rd Qu.: 69.50  
##  Max.   :2010   Max.   :190.00
ggplot(ex1509, aes(x = Year, y = Sunspots)) + 
  geom_point() + 
  geom_smooth() # smoother, not lm
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Poisson regression is closer to analysis using the logarithm of the response. But when using count data, some of the counts may be zero. A common approach is to add 1 to the count, but here we just plotted the data as is because there are only 3 zeros out of 311 data points.

ggplot(ex1509, aes(x = Year, y = Sunspots)) + 
  geom_point() + 
  scale_y_log10() + 
  geom_smooth()
## Warning: Transformation introduced infinite values in continuous y-axis
## Transformation introduced infinite values in continuous y-axis
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (`stat_smooth()`).

These data are not the best demonstration of Poisson regression because there is a clear temporal pattern. With that being said, if you are just interested in the overall trend of sunspots through time, Poisson regression is still a reasonable approach.

m <- glm(Sunspots ~ Year, 
         data = ex1509,
         family = poisson)

Parameter estimation

The maximum likelihood estimator for this model is not available in closed form and thus we use an iterative algorithm to find the answers. But we don’t have to worry about these details.

coef(m)
## (Intercept)        Year 
##   0.4108065   0.0018752

This MLE can be used to determine asymptotic confidence intervals.

confint(m)
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) 0.078534908 0.742477234
## Year        0.001697973 0.002052615

The fact that the interval for the coefficient for Year provides evidence that (if the remaining assumptions in the model are true) that the coefficient for Year is unlikely to be zero.

Testing

We can also compute p-values for a single parameter being zero or a collection of parameters being zero.

summary(m)$coefficients
##              Estimate   Std. Error  z value     Pr(>|z|)
## (Intercept) 0.4108065 1.693740e-01  2.42544 1.528983e-02
## Year        0.0018752 9.047046e-05 20.72720 1.969002e-95

A small p-value here indicates evidence against the model that has that coefficient equal to 0. But remember that other aspects of the model could be problematic, e.g. independence, constant probability (within a group), etc.

The p-value above is based on a Wald test. An alternative is based on a likelihood ratio test.

anova(m, test = "LRT")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Sunspots
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   310    10072.0              
## Year  1   433.27       309     9638.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unlike with the linear regression (which is based on the normal distribution) the p-values here don’t match exactly.

Predictions

When computing predictions, we need to be explicit about whether we want predictions on the linear part, i.e. \(\eta_i\), or the response, i.e. \(\lambda_i = E[Y_i]\). By default, the predict() function provides predictions on the linear part.

p <- bind_cols(ex1509, 
               predict(m, newdata = ex1509, se.fit = TRUE)) %>%
  mutate(
    lcl = fit - qnorm(.975)*se.fit,
    ucl = fit + qnorm(.975)*se.fit,
    
    # convert to expected count
    fit = exp(fit),
    lcl = exp(lcl),
    ucl = exp(ucl)
  ) %>%
  select(-residual.scale)

head(p)
##   Year Sunspots      fit     se.fit      lcl      ucl
## 1 1700        5 36.54872 0.01736342 35.32583 37.81394
## 2 1701       11 36.61732 0.01728332 35.39769 37.87897
## 3 1702       16 36.68605 0.01720332 35.46969 37.94412
## 4 1703       23 36.75491 0.01712343 35.54183 38.00938
## 5 1704       36 36.82390 0.01704364 35.61411 38.07477
## 6 1705       58 36.89301 0.01696397 35.68653 38.14028
ggplot(p, aes(x = Year, 
              y = Sunspots,
              ymin = lcl,
              ymax = ucl)) + 
  geom_point() + 
  geom_ribbon(fill = "blue", alpha = 0.5) +
  geom_line(aes(y = fit), color = "blue")

emmeans()

Another alternative is to use the emmeans::emmeans() function.

emmeans(m, pairwise ~ Year, type = "response", 
        at = list(Year = c(quantile(ex1509$Year, c(0,.5,1)))))
## $emmeans
##  Year rate    SE  df asymp.LCL asymp.UCL
##  1700 36.5 0.635 Inf      35.3      37.8
##  1855 48.9 0.399 Inf      48.1      49.7
##  2010 65.4 0.981 Inf      63.5      67.3
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast            ratio     SE  df null z.ratio p.value
##  Year1700 / Year1855 0.748 0.0105 Inf    1 -20.727  <.0001
##  Year1700 / Year2010 0.559 0.0157 Inf    1 -20.727  <.0001
##  Year1855 / Year2010 0.748 0.0105 Inf    1 -20.727  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## Tests are performed on the log scale

Diagnostics

It is more difficult to assess model assumptions in logistic regression models.

resid_panel(m)

resid_xpanel(m)

Examples

Number of species

case0801
##    Area Species
## 1 44218     100
## 2 29371     108
## 3  4244      45
## 4  3435      53
## 5    32      16
## 6     5      11
## 7     1       7

Due to the large ratio of max to min area, I would start with a logarithm for Area. Because we are considering a Poisson regression model which includes a logarithmic link function, we should plot Species on a log scale.

g <- ggplot(case0801, aes(x = Area, y = Species)) + 
  geom_point() +
  scale_x_log10() + 
  scale_y_log10()

g

Looks pretty linear.

m <- glm(Species ~ log(Area), 
         data = case0801,
         family = "poisson")
summary(m)
## 
## Call:
## glm(formula = Species ~ log(Area), family = "poisson", data = case0801)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.44771   1.42096  -1.58787  -0.06253   0.08458   0.43416   0.27131  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.84158    0.20679   8.906   <2e-16 ***
## log(Area)    0.26251    0.02216  11.845   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 224.0801  on 6  degrees of freedom
## Residual deviance:   5.0141  on 5  degrees of freedom
## AIC: 46.119
## 
## Number of Fisher Scoring iterations: 4
confint(m)
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 1.4166350 2.2285479
## log(Area)   0.2206257 0.3076378
p <- bind_cols(case0801,
               predict(m, se.fit = TRUE)) %>% # defaults to observed data
  mutate(
    lcl = fit - qnorm(.975)*se.fit,
    ucl = fit + qnorm(.975)*se.fit,
    
    fit = exp(fit),
    lcl = exp(lcl),
    ucl = exp(ucl)
  )
g + geom_ribbon(data = p, 
              aes(
                x = Area,
                ymin = lcl,
                ymax = ucl
              ),
              fill = "blue",
              alpha = 0.5) +
  geom_line(data = p,
            aes(
              x = Area,
              y = fit
            ), color = "blue") +
  labs(
    x = "Island Area (sq. miles)",
    y = "Number of Species",
    title = "Number of Reptile and Amphibian Species v Island Area"
  )

Warpbreaks

summary(warpbreaks)
##      breaks      wool   tension
##  Min.   :10.00   A:27   L:18   
##  1st Qu.:18.25   B:27   M:18   
##  Median :26.00          H:18   
##  Mean   :28.15                 
##  3rd Qu.:34.00                 
##  Max.   :70.00
g <- ggplot(warpbreaks, 
       aes(x = tension,
           y = breaks,
           color = wool,
           shape = wool)) +
  geom_point(position = position_jitterdodge(dodge.width = 0.1)) +
  scale_y_log10()
g

m <- glm(breaks ~ wool * tension,
         data = warpbreaks,
         family = "poisson")
resid_panel(m, qqbands = TRUE, smoother = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

anova(m, test = "LRT")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: breaks
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            53     297.37              
## wool          1   16.039        52     281.33 6.206e-05 ***
## tension       2   70.942        50     210.39 3.938e-16 ***
## wool:tension  2   28.087        48     182.31 7.962e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
nd <- warpbreaks %>%
  select(wool, tension) %>%
  unique()

p <- bind_cols(nd, 
               predict(m, nd, se.fit = TRUE)) %>%
  mutate(
    lcl = fit - qnorm(.975)*se.fit,
    ucl = fit + qnorm(.975)*se.fit,
    
    # Convert to expected count scale
    fit = exp(fit),
    lcl = exp(lcl),
    ucl = exp(ucl)
  )

g + 
  geom_pointrange(
    data = p,
    aes(y = fit,
        ymin = lcl,
        ymax = ucl),
    position = position_dodge(width = 0.1)
  ) + 
  geom_line(
    data = p,
    aes(y = fit, group = wool)
  ) +
  labs(
    x = "Tension",
    y = "Number of breaks",
    title = "Number of Breaks vs Tension and Wool Type",
    color = "Wool",
    shape = "Wool"
  )

tension <- emmeans(m, pairwise ~ tension, type = "response")
## NOTE: Results may be misleading due to involvement in interactions
confint(tension)
## $emmeans
##  tension rate   SE  df asymp.LCL asymp.UCL
##  L       35.5 1.42 Inf      32.8      38.4
##  M       26.3 1.21 Inf      24.0      28.8
##  H       21.5 1.10 Inf      19.4      23.7
## 
## Results are averaged over the levels of: wool 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  L / M     1.35 0.0824 Inf      1.17      1.56
##  L / H     1.65 0.1073 Inf      1.42      1.92
##  M / H     1.22 0.0842 Inf      1.04      1.44
## 
## Results are averaged over the levels of: wool 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates 
## Intervals are back-transformed from the log scale
wool    <- emmeans(m, pairwise ~ wool,    type = "response")
## NOTE: Results may be misleading due to involvement in interactions
confint(wool)
## $emmeans
##  wool rate    SE  df asymp.LCL asymp.UCL
##  A    29.7 1.069 Inf      27.7      31.9
##  B    24.8 0.968 Inf      23.0      26.8
## 
## Results are averaged over the levels of: tension 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  A / B      1.2 0.0636 Inf      1.08      1.33
## 
## Results are averaged over the levels of: tension 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

Perhaps you didn’t want to compare the number of breaks for one explanatory variable averaged over the levels of the other explanatory variable. Instead, you wanted to compare the number of breaks for one explanatory variable at the levels of the other explanatory variable.

tension <- emmeans(m, pairwise ~ tension | wool, type = "response")
confint(tension)
## $emmeans
## wool = A:
##  tension rate   SE  df asymp.LCL asymp.UCL
##  L       44.6 2.22 Inf      40.4      49.1
##  M       24.0 1.63 Inf      21.0      27.4
##  H       24.6 1.65 Inf      21.5      28.0
## 
## wool = B:
##  tension rate   SE  df asymp.LCL asymp.UCL
##  L       28.2 1.77 Inf      25.0      31.9
##  M       28.8 1.79 Inf      25.5      32.5
##  H       18.8 1.44 Inf      16.1      21.8
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
## wool = A:
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  L / M    1.856 0.1567 Inf     1.523      2.26
##  L / H    1.814 0.1520 Inf     1.491      2.21
##  M / H    0.977 0.0935 Inf     0.781      1.22
## 
## wool = B:
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  L / M    0.981 0.0866 Inf     0.797      1.21
##  L / H    1.503 0.1492 Inf     1.191      1.90
##  M / H    1.533 0.1515 Inf     1.216      1.93
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates 
## Intervals are back-transformed from the log scale
# Compare wool 
wool    <- emmeans(m, pairwise ~ wool | tension, type = "response")
confint(wool)
## $emmeans
## tension = L:
##  wool rate   SE  df asymp.LCL asymp.UCL
##  A    44.6 2.22 Inf      40.4      49.1
##  B    28.2 1.77 Inf      25.0      31.9
## 
## tension = M:
##  wool rate   SE  df asymp.LCL asymp.UCL
##  A    24.0 1.63 Inf      21.0      27.4
##  B    28.8 1.79 Inf      25.5      32.5
## 
## tension = H:
##  wool rate   SE  df asymp.LCL asymp.UCL
##  A    24.6 1.65 Inf      21.5      28.0
##  B    18.8 1.44 Inf      16.1      21.8
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
## tension = L:
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  A / B    1.579 0.1266 Inf     1.349     1.847
## 
## tension = M:
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  A / B    0.834 0.0768 Inf     0.696     0.999
## 
## tension = H:
##  contrast ratio     SE  df asymp.LCL asymp.UCL
##  A / B    1.308 0.1336 Inf     1.070     1.598
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale
# Get all the comparisons
both    <- emmeans(m, pairwise ~ wool + tension, type = "response")
confint(both)
## $emmeans
##  wool tension rate   SE  df asymp.LCL asymp.UCL
##  A    L       44.6 2.22 Inf      40.4      49.1
##  B    L       28.2 1.77 Inf      25.0      31.9
##  A    M       24.0 1.63 Inf      21.0      27.4
##  B    M       28.8 1.79 Inf      25.5      32.5
##  A    H       24.6 1.65 Inf      21.5      28.0
##  B    H       18.8 1.44 Inf      16.1      21.8
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## 
## $contrasts
##  contrast  ratio     SE  df asymp.LCL asymp.UCL
##  A L / B L 1.579 0.1266 Inf     1.256      1.98
##  A L / A M 1.856 0.1567 Inf     1.460      2.36
##  A L / B M 1.548 0.1234 Inf     1.234      1.94
##  A L / A H 1.814 0.1520 Inf     1.429      2.30
##  A L / B H 2.373 0.2176 Inf     1.827      3.08
##  B L / A M 1.176 0.1088 Inf     0.903      1.53
##  B L / B M 0.981 0.0866 Inf     0.763      1.26
##  B L / A H 1.149 0.1057 Inf     0.884      1.49
##  B L / B H 1.503 0.1492 Inf     1.133      1.99
##  A M / B M 0.834 0.0768 Inf     0.641      1.08
##  A M / A H 0.977 0.0935 Inf     0.744      1.28
##  A M / B H 1.278 0.1313 Inf     0.954      1.71
##  B M / A H 1.172 0.1073 Inf     0.903      1.52
##  B M / B H 1.533 0.1515 Inf     1.156      2.03
##  A H / B H 1.308 0.1336 Inf     0.977      1.75
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 6 estimates 
## Intervals are back-transformed from the log scale