Regression models explain the relationship between a response variable (\(Y\)) and a collection of explanatory variables (\(X\)). We use different types of regression models depending on the type of response variable we have:
Linear regression (described below) assumes a normal (of Gaussian)
distribution for the data (but really the errors). Poisson
regression assumes a Poisson
distribution for the data.
Logistic regression assumes a binomial
distribution for the data (or Bernoulli when the response is
binary).
Regression models alter a bit depending on the data type of the explanatory variables. The most natural approach is when the explanatory variable(s) are continuous because then we can think about the regression model drawing a line through the data. But we have a lot of flexibility on how to include this continuous variable in the model including take a logarithm of the variable and squaring the variable. When the explanatory variables are categorical, then we need to construct dummy (or indicator) variables to include in the regression model. In addition, if we have interaction terms (which we generally think of as multiplying explanatory variables), then interpretation of the regression model becomes more difficult and nuanced.
The simple linear regression (SLR) models expands the applicability
of normal based modeling be shifting the mean according to one
explanatory variable. I use the terms response
variable (\(Y\)) and
explanatory variable (\(X\))
to
identify the variable that explains the change in the response
variable.
To run a simple linear regression model, we need to record both the response and explanatory variable for each observation, i.e. we have a double \((Y_i,X_i)\) for all \(i=1,\ldots,n\). Then our model is \[Y_i \stackrel{ind}{\sim} N(\beta_0+\beta_1X_i,\sigma^2).\] Although this is a succinct way of writing the model, I think a more intuitive way to write the model is \[Y_i = \beta_0+\beta_1X_i+\epsilon_i, \qquad \epsilon_i\stackrel{ind}{\sim} N(0,\sigma^2).\] This notation more directly provides the 4 assumptions in a simple linear regression model:
The last assumption is most straight-forward when you write \[E[Y_i] = \beta_0+\beta_1X_i.\]
Now we could go through all the formulas related with the SLR model, but suffice it to say that we can construct confidence intervals and hypothesis tests for any \(\beta\) and these involve the \(T\)-distribution. Software generally defaults to providing hypothesis tests relative to the \(\beta\)s being 0. This makes sense for \(\beta_1\) since, if it is 0, then that means the explanatory variable has no effect on the response variable.
All the usual caveats apply for hypothesis tests here. Have you read the ASA Statement on \(p\)-Values yet?
The most useful aspect of SLR models is the simplicity of interpretation:
For each unit increase in the explanatory variable, the mean of the response variable increases by \(\beta_1\).
Multiple (linear) regression is the extension of the SLR model to more than one continuous or binary explanatory variable. It also includes quadratic terms as well as interactions. Now, for each observation, you need to collect the value of the response variable as well as all of the explanatory variables, i.e. \((Y_i,X_{i,1},\ldots,X_{i,p})\) for all \(i=1,\ldots,n\). (I’m thinking spreadsheets might be pretty handy about now.)
The model can be written \[Y_i = \beta_0+\beta_1X_{i,p}+\ldots+\beta_p X_{i,p}+\epsilon_i, \qquad \epsilon_i\stackrel{ind}{\sim} N(0,\sigma^2).\] But, it is much more succinct when you use some linear algebra \[Y = X\beta + e, \quad e \sim N(0,\sigma^2\mathrm{I})\] but now we have to define
Note: the models aren’t quite identical since the matrix version does not have an explicit intercept, but instead you can include an intercept by having the first column of X be all 1s.
Using some linear algebra and assuming we have a full rank \(X\), the maximum likelihood estimator for \(\beta\) is nice and succinct \[\hat\beta_{MLE} = (X^\top X)^{-1}X^\top y.\] How pretty is that!! (Break this out at parties and impress all your friends. Don’t write it down, just say it.)
Multiple regression models are much better at representing reality than simple linear regression models because, generally,
With this flexibility, we do lose some interpretability. If there are no interactions, then we can interpret the coefficient for the \(j\)th explanatory variable like this
holding all other variables constant, a one unit change in the \(j\)th explanatory variable increases the mean of the response variable by \(\beta_j\).
When you read about this in the newspaper or a journal article, they will often use the phrase “after adjusting for …” or “after controlling for …”.
When interactions are present, everything becomes much more complicated. Are you ready to use figures yet?
In this section, we will visualize a bunch of regression models using the ggplot2 R package. First, we need to load the package
library("ggplot2") # you may need to install it first using install.packages("ggplot2")
Also, the data we will use is the in Sleuth3 R package, so it also needs to be loaded.
library("Sleuth3") # you may need to install it first using install.packages("Sleuth3")
We will start with the simple linear regression (SLR) model, but using logarithms we will show that there is a lot of flexibility even in this relatively trivial model. The simple in SLR indicates that there is a single explanatory variable and, typically, it is continuous.
ggplot(Sleuth3::case0801,
aes(x = Area, y = Species)) +
geom_point() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)
ggplot(Sleuth3::case0801,
aes(x = Area, y = Species)) +
geom_point() +
scale_x_log10() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)
ggplot(Sleuth3::case0801,
aes(x = Area, y = Species)) +
geom_point() +
scale_y_log10() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)
ggplot(Sleuth3::case0801,
aes(x = Area, y = Species)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)
For good measure, let’s run the regression analysis in R
m <- lm(log(Species) ~ log(Area), data = Sleuth3::case0801)
summary(m)
##
## Call:
## lm(formula = log(Species) ~ log(Area), data = Sleuth3::case0801)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.0021358 0.1769753 -0.2154872 0.0009468 -0.0292440 0.0595428 0.0094020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.93651 0.08813 21.97 3.62e-06 ***
## log(Area) 0.24968 0.01211 20.62 4.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1283 on 5 degrees of freedom
## Multiple R-squared: 0.9884, Adjusted R-squared: 0.9861
## F-statistic: 425.3 on 1 and 5 DF, p-value: 4.962e-06
Since Species
is a count with no maximum, we may really
want to run a Poisson regression analysis.
m <- glm(Species ~ log(Area),
data = Sleuth3::case0801,
family = poisson)
summary(m)
##
## Call:
## glm(formula = Species ~ log(Area), family = poisson, data = Sleuth3::case0801)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## -0.44771 1.42096 -1.58787 -0.06253 0.08458 0.43416 0.27131
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.84158 0.20679 8.906 <2e-16 ***
## log(Area) 0.26251 0.02216 11.845 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 224.0801 on 6 degrees of freedom
## Residual deviance: 5.0141 on 5 degrees of freedom
## AIC: 46.119
##
## Number of Fisher Scoring iterations: 4
ggplot(Sleuth3::case0801,
aes(x = 1*I(Area>100), y = Species)) +
geom_point() +
scale_y_log10() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)
library("dplyr") # may need to install.packages("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
case0901 <- Sleuth3::case0901 %>%
mutate(Time = ifelse(Time == 1, "Late", "Early"))
dim(case0901)
## [1] 24 3
summary(case0901)
## Flowers Time Intensity
## Min. :31.30 Length:24 Min. :150
## 1st Qu.:45.42 Class :character 1st Qu.:300
## Median :54.75 Mode :character Median :525
## Mean :56.14 Mean :525
## 3rd Qu.:64.45 3rd Qu.:750
## Max. :78.00 Max. :900
head(case0901)
## Flowers Time Intensity
## 1 62.3 Late 150
## 2 77.4 Late 150
## 3 55.3 Late 300
## 4 54.2 Late 300
## 5 49.6 Late 450
## 6 61.9 Late 450
Let’s start by ignore Time
.
ggplot(case0901,
aes(x = Intensity, y = Flowers)) +
geom_point() +
labs(title = "Effects of Light on Meadowfoam Flowering") +
geom_smooth(method = "lm", formula = y ~ x)
m <- lm(Flowers ~ Intensity, data = case0901)
summary(m)
##
## Call:
## lm(formula = Flowers ~ Intensity, data = case0901)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7314 -7.8052 0.0186 6.1857 13.2686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.385000 4.161186 18.597 6.06e-15 ***
## Intensity -0.040471 0.007123 -5.682 1.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.94 on 22 degrees of freedom
## Multiple R-squared: 0.5947, Adjusted R-squared: 0.5763
## F-statistic: 32.28 on 1 and 22 DF, p-value: 1.03e-05
Let’s add Time
ggplot(case0901,
aes(x = Intensity, y = Flowers, color = Time, shape = Time)) +
geom_point() +
labs(title = "Effects of Light on Meadowfoam Flowering") +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
By default, ggplot2::geom_smooth
plots separate lines
for the two Time
s and thus is equivalent to a model with an
interaction between Time
and Intensity
. The
fact that these lines look parallel indicates an interaction is not
needed.
m <- lm(Flowers ~ Intensity*Time, data = case0901)
anova(m)
## Analysis of Variance Table
##
## Response: Flowers
## Df Sum Sq Mean Sq F value Pr(>F)
## Intensity 1 2579.75 2579.75 59.2597 2.101e-07 ***
## Time 1 886.95 886.95 20.3742 0.0002119 ***
## Intensity:Time 1 0.58 0.58 0.0132 0.9095675
## Residuals 20 870.66 43.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction is not significant, so it will simplify our interpretation if we exclude it. (Please note though, that you could keep it in the model.)
m <- lm(Flowers ~ Intensity+Time, data = case0901)
anova(m)
## Analysis of Variance Table
##
## Response: Flowers
## Df Sum Sq Mean Sq F value Pr(>F)
## Intensity 1 2579.75 2579.75 62.181 1.037e-07 ***
## Time 1 886.95 886.95 21.379 0.0001464 ***
## Residuals 21 871.24 41.49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m)
##
## Call:
## lm(formula = Flowers ~ Intensity + Time, data = case0901)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.652 -4.139 -1.558 5.632 12.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.464167 3.273772 25.495 < 2e-16 ***
## Intensity -0.040471 0.005132 -7.886 1.04e-07 ***
## TimeLate -12.158333 2.629557 -4.624 0.000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
## F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08