library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("Sleuth3")
library("ggResidpanel")
library("emmeans")
Let \(Y_i\) be a Poisson response variable. Assume \[ Y_i \stackrel{ind}{\sim} Po(\lambda_i), \quad \eta_i = \log\left(\lambda_i \right) = \beta_0 + \beta_1X_{i,1} + \cdots + \beta_{p-1} X_{i,p-1} \]
If we invert, then \[ E[Y_i] = e^{\beta_0 + \beta_1X_{i,1} + \cdots + \beta_{p-1} X_{i,p-1}}. \]
To fit a Poisson regression model in R, you use the
glm()
function with argument
family = "poisson"
.
summary(ex1509)
## Year Sunspots
## Min. :1700 Min. : 0.00
## 1st Qu.:1778 1st Qu.: 16.00
## Median :1855 Median : 40.00
## Mean :1855 Mean : 49.57
## 3rd Qu.:1932 3rd Qu.: 69.50
## Max. :2010 Max. :190.00
ggplot(ex1509, aes(x = Year, y = Sunspots)) +
geom_point() +
geom_smooth() # smoother, not lm
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Poisson regression is closer to analysis using the logarithm of the response. But when using count data, some of the counts may be zero. A common approach is to add 1 to the count, but here we just plotted the data as is because there are only 3 zeros out of 311 data points.
ggplot(ex1509, aes(x = Year, y = Sunspots)) +
geom_point() +
scale_y_log10() +
geom_smooth()
## Warning: Transformation introduced infinite values in continuous y-axis
## Transformation introduced infinite values in continuous y-axis
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (`stat_smooth()`).
These data are not the best demonstration of Poisson regression because there is a clear temporal pattern. With that being said, if you are just interested in the overall trend of sunspots through time, Poisson regression is still a reasonable approach.
m <- glm(Sunspots ~ Year,
data = ex1509,
family = poisson)
The maximum likelihood estimator for this model is not available in closed form and thus we use an iterative algorithm to find the answers. But we don’t have to worry about these details.
coef(m)
## (Intercept) Year
## 0.4108065 0.0018752
This MLE can be used to determine asymptotic confidence intervals.
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.078534908 0.742477234
## Year 0.001697973 0.002052615
The fact that the interval for the coefficient for Year provides evidence that (if the remaining assumptions in the model are true) that the coefficient for Year is unlikely to be zero.
We can also compute p-values for a single parameter being zero or a collection of parameters being zero.
summary(m)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4108065 1.693740e-01 2.42544 1.528983e-02
## Year 0.0018752 9.047046e-05 20.72720 1.969002e-95
A small p-value here indicates evidence against the model that has that coefficient equal to 0. But remember that other aspects of the model could be problematic, e.g. independence, constant probability (within a group), etc.
The p-value above is based on a Wald test. An alternative is based on a likelihood ratio test.
anova(m, test = "LRT")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Sunspots
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 310 10072.0
## Year 1 433.27 309 9638.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unlike with the linear regression (which is based on the normal distribution) the p-values here don’t match exactly.
When computing predictions, we need to be explicit about whether we
want predictions on the linear part, i.e. \(\eta_i\), or the response, i.e. \(\lambda_i = E[Y_i]\). By default, the
predict()
function provides predictions on the linear
part.
p <- bind_cols(ex1509,
predict(m, newdata = ex1509, se.fit = TRUE)) %>%
mutate(
lcl = fit - qnorm(.975)*se.fit,
ucl = fit + qnorm(.975)*se.fit,
# convert to expected count
fit = exp(fit),
lcl = exp(lcl),
ucl = exp(ucl)
) %>%
select(-residual.scale)
head(p)
## Year Sunspots fit se.fit lcl ucl
## 1 1700 5 36.54872 0.01736342 35.32583 37.81394
## 2 1701 11 36.61732 0.01728332 35.39769 37.87897
## 3 1702 16 36.68605 0.01720332 35.46969 37.94412
## 4 1703 23 36.75491 0.01712343 35.54183 38.00938
## 5 1704 36 36.82390 0.01704364 35.61411 38.07477
## 6 1705 58 36.89301 0.01696397 35.68653 38.14028
ggplot(p, aes(x = Year,
y = Sunspots,
ymin = lcl,
ymax = ucl)) +
geom_point() +
geom_ribbon(fill = "blue", alpha = 0.5) +
geom_line(aes(y = fit), color = "blue")
Another alternative is to use the emmeans::emmeans()
function.
emmeans(m, pairwise ~ Year, type = "response",
at = list(Year = c(quantile(ex1509$Year, c(0,.5,1)))))
## $emmeans
## Year rate SE df asymp.LCL asymp.UCL
## 1700 36.5 0.635 Inf 35.3 37.8
## 1855 48.9 0.399 Inf 48.1 49.7
## 2010 65.4 0.981 Inf 63.5 67.3
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df null z.ratio p.value
## Year1700 / Year1855 0.748 0.0105 Inf 1 -20.727 <.0001
## Year1700 / Year2010 0.559 0.0157 Inf 1 -20.727 <.0001
## Year1855 / Year2010 0.748 0.0105 Inf 1 -20.727 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
## Tests are performed on the log scale
It is more difficult to assess model assumptions in logistic regression models.
resid_panel(m)
resid_xpanel(m)
case0801
## Area Species
## 1 44218 100
## 2 29371 108
## 3 4244 45
## 4 3435 53
## 5 32 16
## 6 5 11
## 7 1 7
Due to the large ratio of max to min area, I would start with a logarithm for Area. Because we are considering a Poisson regression model which includes a logarithmic link function, we should plot Species on a log scale.
g <- ggplot(case0801, aes(x = Area, y = Species)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
g
Looks pretty linear.
m <- glm(Species ~ log(Area),
data = case0801,
family = "poisson")
summary(m)
##
## Call:
## glm(formula = Species ~ log(Area), family = "poisson", data = case0801)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## -0.44771 1.42096 -1.58787 -0.06253 0.08458 0.43416 0.27131
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.84158 0.20679 8.906 <2e-16 ***
## log(Area) 0.26251 0.02216 11.845 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 224.0801 on 6 degrees of freedom
## Residual deviance: 5.0141 on 5 degrees of freedom
## AIC: 46.119
##
## Number of Fisher Scoring iterations: 4
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.4166350 2.2285479
## log(Area) 0.2206257 0.3076378
p <- bind_cols(case0801,
predict(m, se.fit = TRUE)) %>% # defaults to observed data
mutate(
lcl = fit - qnorm(.975)*se.fit,
ucl = fit + qnorm(.975)*se.fit,
fit = exp(fit),
lcl = exp(lcl),
ucl = exp(ucl)
)
g + geom_ribbon(data = p,
aes(
x = Area,
ymin = lcl,
ymax = ucl
),
fill = "blue",
alpha = 0.5) +
geom_line(data = p,
aes(
x = Area,
y = fit
), color = "blue") +
labs(
x = "Island Area (sq. miles)",
y = "Number of Species",
title = "Number of Reptile and Amphibian Species v Island Area"
)
summary(warpbreaks)
## breaks wool tension
## Min. :10.00 A:27 L:18
## 1st Qu.:18.25 B:27 M:18
## Median :26.00 H:18
## Mean :28.15
## 3rd Qu.:34.00
## Max. :70.00
g <- ggplot(warpbreaks,
aes(x = tension,
y = breaks,
color = wool,
shape = wool)) +
geom_point(position = position_jitterdodge(dodge.width = 0.1)) +
scale_y_log10()
g
m <- glm(breaks ~ wool * tension,
data = warpbreaks,
family = "poisson")
resid_panel(m, qqbands = TRUE, smoother = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
anova(m, test = "LRT")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: breaks
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 53 297.37
## wool 1 16.039 52 281.33 6.206e-05 ***
## tension 2 70.942 50 210.39 3.938e-16 ***
## wool:tension 2 28.087 48 182.31 7.962e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
nd <- warpbreaks %>%
select(wool, tension) %>%
unique()
p <- bind_cols(nd,
predict(m, nd, se.fit = TRUE)) %>%
mutate(
lcl = fit - qnorm(.975)*se.fit,
ucl = fit + qnorm(.975)*se.fit,
# Convert to expected count scale
fit = exp(fit),
lcl = exp(lcl),
ucl = exp(ucl)
)
g +
geom_pointrange(
data = p,
aes(y = fit,
ymin = lcl,
ymax = ucl),
position = position_dodge(width = 0.1)
) +
geom_line(
data = p,
aes(y = fit, group = wool)
) +
labs(
x = "Tension",
y = "Number of breaks",
title = "Number of Breaks vs Tension and Wool Type",
color = "Wool",
shape = "Wool"
)
tension <- emmeans(m, pairwise ~ tension, type = "response")
## NOTE: Results may be misleading due to involvement in interactions
confint(tension)
## $emmeans
## tension rate SE df asymp.LCL asymp.UCL
## L 35.5 1.42 Inf 32.8 38.4
## M 26.3 1.21 Inf 24.0 28.8
## H 21.5 1.10 Inf 19.4 23.7
##
## Results are averaged over the levels of: wool
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df asymp.LCL asymp.UCL
## L / M 1.35 0.0824 Inf 1.17 1.56
## L / H 1.65 0.1073 Inf 1.42 1.92
## M / H 1.22 0.0842 Inf 1.04 1.44
##
## Results are averaged over the levels of: wool
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
## Intervals are back-transformed from the log scale
wool <- emmeans(m, pairwise ~ wool, type = "response")
## NOTE: Results may be misleading due to involvement in interactions
confint(wool)
## $emmeans
## wool rate SE df asymp.LCL asymp.UCL
## A 29.7 1.069 Inf 27.7 31.9
## B 24.8 0.968 Inf 23.0 26.8
##
## Results are averaged over the levels of: tension
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df asymp.LCL asymp.UCL
## A / B 1.2 0.0636 Inf 1.08 1.33
##
## Results are averaged over the levels of: tension
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
Perhaps you didn’t want to compare the number of breaks for one explanatory variable averaged over the levels of the other explanatory variable. Instead, you wanted to compare the number of breaks for one explanatory variable at the levels of the other explanatory variable.
tension <- emmeans(m, pairwise ~ tension | wool, type = "response")
confint(tension)
## $emmeans
## wool = A:
## tension rate SE df asymp.LCL asymp.UCL
## L 44.6 2.22 Inf 40.4 49.1
## M 24.0 1.63 Inf 21.0 27.4
## H 24.6 1.65 Inf 21.5 28.0
##
## wool = B:
## tension rate SE df asymp.LCL asymp.UCL
## L 28.2 1.77 Inf 25.0 31.9
## M 28.8 1.79 Inf 25.5 32.5
## H 18.8 1.44 Inf 16.1 21.8
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## wool = A:
## contrast ratio SE df asymp.LCL asymp.UCL
## L / M 1.856 0.1567 Inf 1.523 2.26
## L / H 1.814 0.1520 Inf 1.491 2.21
## M / H 0.977 0.0935 Inf 0.781 1.22
##
## wool = B:
## contrast ratio SE df asymp.LCL asymp.UCL
## L / M 0.981 0.0866 Inf 0.797 1.21
## L / H 1.503 0.1492 Inf 1.191 1.90
## M / H 1.533 0.1515 Inf 1.216 1.93
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
## Intervals are back-transformed from the log scale
# Compare wool
wool <- emmeans(m, pairwise ~ wool | tension, type = "response")
confint(wool)
## $emmeans
## tension = L:
## wool rate SE df asymp.LCL asymp.UCL
## A 44.6 2.22 Inf 40.4 49.1
## B 28.2 1.77 Inf 25.0 31.9
##
## tension = M:
## wool rate SE df asymp.LCL asymp.UCL
## A 24.0 1.63 Inf 21.0 27.4
## B 28.8 1.79 Inf 25.5 32.5
##
## tension = H:
## wool rate SE df asymp.LCL asymp.UCL
## A 24.6 1.65 Inf 21.5 28.0
## B 18.8 1.44 Inf 16.1 21.8
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## tension = L:
## contrast ratio SE df asymp.LCL asymp.UCL
## A / B 1.579 0.1266 Inf 1.349 1.847
##
## tension = M:
## contrast ratio SE df asymp.LCL asymp.UCL
## A / B 0.834 0.0768 Inf 0.696 0.999
##
## tension = H:
## contrast ratio SE df asymp.LCL asymp.UCL
## A / B 1.308 0.1336 Inf 1.070 1.598
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
# Get all the comparisons
both <- emmeans(m, pairwise ~ wool + tension, type = "response")
confint(both)
## $emmeans
## wool tension rate SE df asymp.LCL asymp.UCL
## A L 44.6 2.22 Inf 40.4 49.1
## B L 28.2 1.77 Inf 25.0 31.9
## A M 24.0 1.63 Inf 21.0 27.4
## B M 28.8 1.79 Inf 25.5 32.5
## A H 24.6 1.65 Inf 21.5 28.0
## B H 18.8 1.44 Inf 16.1 21.8
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df asymp.LCL asymp.UCL
## A L / B L 1.579 0.1266 Inf 1.256 1.98
## A L / A M 1.856 0.1567 Inf 1.460 2.36
## A L / B M 1.548 0.1234 Inf 1.234 1.94
## A L / A H 1.814 0.1520 Inf 1.429 2.30
## A L / B H 2.373 0.2176 Inf 1.827 3.08
## B L / A M 1.176 0.1088 Inf 0.903 1.53
## B L / B M 0.981 0.0866 Inf 0.763 1.26
## B L / A H 1.149 0.1057 Inf 0.884 1.49
## B L / B H 1.503 0.1492 Inf 1.133 1.99
## A M / B M 0.834 0.0768 Inf 0.641 1.08
## A M / A H 0.977 0.0935 Inf 0.744 1.28
## A M / B H 1.278 0.1313 Inf 0.954 1.71
## B M / A H 1.172 0.1073 Inf 0.903 1.52
## B M / B H 1.533 0.1515 Inf 1.156 2.03
## A H / B H 1.308 0.1336 Inf 0.977 1.75
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 6 estimates
## Intervals are back-transformed from the log scale