R code

Regression

Regression models explain the relationship between a response variable (\(Y\)) and a collection of explanatory variables (\(X\)). We use different types of regression models depending on the type of response variable we have:

  • Continuous response \(\longrightarrow\) linear regression
  • Count (with no maximum) response \(\longrightarrow\) Poisson regression
  • Count (with a maximum) response \(\longrightarrow\) logistic regression
  • Binary response \(\longrightarrow\) logistic regression

Linear regression (described below) assumes a normal (of Gaussian) distribution for the data (but really the errors). Poisson regression assumes a Poisson distribution for the data.
Logistic regression assumes a binomial distribution for the data (or Bernoulli when the response is binary).

Regression models alter a bit depending on the data type of the explanatory variables. The most natural approach is when the explanatory variable(s) are continuous because then we can think about the regression model drawing a line through the data. But we have a lot of flexibility on how to include this continuous variable in the model including take a logarithm of the variable and squaring the variable. When the explanatory variables are categorical, then we need to construct dummy (or indicator) variables to include in the regression model. In addition, if we have interaction terms (which we generally think of as multiplying explanatory variables), then interpretation of the regression model becomes more difficult and nuanced.

Simple linear regression

The simple linear regression (SLR) models expands the applicability of normal based modeling be shifting the mean according to one explanatory variable. I use the terms response variable (\(Y\)) and explanatory variable (\(X\)) to
identify the variable that explains the change in the response variable.

To run a simple linear regression model, we need to record both the response and explanatory variable for each observation, i.e. we have a double \((Y_i,X_i)\) for all \(i=1,\ldots,n\). Then our model is \[Y_i \stackrel{ind}{\sim} N(\beta_0+\beta_1X_i,\sigma^2).\] Although this is a succinct way of writing the model, I think a more intuitive way to write the model is \[Y_i = \beta_0+\beta_1X_i+\epsilon_i, \qquad \epsilon_i\stackrel{ind}{\sim} N(0,\sigma^2).\] This notation more directly provides the 4 assumptions in a simple linear regression model:

  • independent errors
  • normal errors
  • errors have constant variance
  • linear relationship between explanatory variable (X) and the mean of the response variable

The last assumption is most straight-forward when you write \[E[Y_i] = \beta_0+\beta_1X_i.\]

Now we could go through all the formulas related with the SLR model, but suffice it to say that we can construct confidence intervals and hypothesis tests for any \(\beta\) and these involve the \(T\)-distribution. Software generally defaults to providing hypothesis tests relative to the \(\beta\)s being 0. This makes sense for \(\beta_1\) since, if it is 0, then that means the explanatory variable has no effect on the response variable.

All the usual caveats apply for hypothesis tests here. Have you read the ASA Statement on \(p\)-Values yet?

Interpretation

The most useful aspect of SLR models is the simplicity of interpretation:

For each unit increase in the explanatory variable, the mean of the response variable increases by \(\beta_1\).

Multiple linear regression

Multiple (linear) regression is the extension of the SLR model to more than one continuous or binary explanatory variable. It also includes quadratic terms as well as interactions. Now, for each observation, you need to collect the value of the response variable as well as all of the explanatory variables, i.e. \((Y_i,X_{i,1},\ldots,X_{i,p})\) for all \(i=1,\ldots,n\). (I’m thinking spreadsheets might be pretty handy about now.)

The model can be written \[Y_i = \beta_0+\beta_1X_{i,p}+\ldots+\beta_p X_{i,p}+\epsilon_i, \qquad \epsilon_i\stackrel{ind}{\sim} N(0,\sigma^2).\] But, it is much more succinct when you use some linear algebra \[Y = X\beta + e, \quad e \sim N(0,\sigma^2\mathrm{I})\] but now we have to define

  • vector \(Y = (Y_1,\ldots,Y_n)^{\top}\),
  • vector \(\beta = (\beta_1,\ldots,\beta_p)\),
  • vector errors \(\epsilon = (\epsilon_1,\ldots,\epsilon_n)\),
  • model matrix \(X\) which is \(n \times p\) with each row being the
  • vector \(X_i = (X_{i,1},\ldots,X_{i,p})\).

Note: the models aren’t quite identical since the matrix version does not have an explicit intercept, but instead you can include an intercept by having the first column of X be all 1s.

Maximum likelihood estimator

Using some linear algebra and assuming we have a full rank \(X\), the maximum likelihood estimator for \(\beta\) is nice and succinct \[\hat\beta_{MLE} = (X^\top X)^{-1}X^\top y.\] How pretty is that!! (Break this out at parties and impress all your friends. Don’t write it down, just say it.)

Interpretation

Multiple regression models are much better at representing reality than simple linear regression models because, generally,

  • there is more than one explanatory variable affecting the response variable and
  • relationships between explanatory variables and the response variable is complicated.

With this flexibility, we do lose some interpretability. If there are no interactions, then we can interpret the coefficient for the \(j\)th explanatory variable like this

holding all other variables constant, a one unit change in the \(j\)th explanatory variable increases the mean of the response variable by \(\beta_j\).

When you read about this in the newspaper or a journal article, they will often use the phrase “after adjusting for …” or “after controlling for …”.

When interactions are present, everything becomes much more complicated. Are you ready to use figures yet?

Visualization

In this section, we will visualize a bunch of regression models using the ggplot2 R package. First, we need to load the package

library("ggplot2")  # you may need to install it first using install.packages("ggplot2")

Also, the data we will use is the in Sleuth3 R package, so it also needs to be loaded.

library("Sleuth3")  # you may need to install it first using install.packages("Sleuth3")

Simple linear regression

We will start with the simple linear regression (SLR) model, but using logarithms we will show that there is a lot of flexibility even in this relatively trivial model. The simple in SLR indicates that there is a single explanatory variable and, typically, it is continuous.

ggplot(Sleuth3::case0801, 
       aes(x = Area, y = Species)) + 
  geom_point() +  
  labs(title = "Number of reptile and amphibian species for West Indies islands") +
  geom_smooth(method = "lm", formula = y ~ x)

Logarithm of explanatory variable

ggplot(Sleuth3::case0801, 
       aes(x = Area, y = Species)) + 
  geom_point() + 
  scale_x_log10() + 
  labs(title = "Number of reptile and amphibian species for West Indies islands") +
  geom_smooth(method = "lm", formula = y ~ x)

Logarithm of response variable

ggplot(Sleuth3::case0801, 
       aes(x = Area, y = Species)) + 
  geom_point() + 
  scale_y_log10() + 
  labs(title = "Number of reptile and amphibian species for West Indies islands") +
  geom_smooth(method = "lm", formula = y ~ x)

Logarithm of both response and explanatory variable

ggplot(Sleuth3::case0801, 
       aes(x = Area, y = Species)) + 
  geom_point() + 
  scale_x_log10() + 
  scale_y_log10() + 
  labs(title = "Number of reptile and amphibian species for West Indies islands") +
  geom_smooth(method = "lm", formula = y ~ x)

For good measure, let’s run the regression analysis in R

m <- lm(log(Species) ~ log(Area), data = Sleuth3::case0801)
summary(m)
## 
## Call:
## lm(formula = log(Species) ~ log(Area), data = Sleuth3::case0801)
## 
## Residuals:
##          1          2          3          4          5          6          7 
## -0.0021358  0.1769753 -0.2154872  0.0009468 -0.0292440  0.0595428  0.0094020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.93651    0.08813   21.97 3.62e-06 ***
## log(Area)    0.24968    0.01211   20.62 4.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1283 on 5 degrees of freedom
## Multiple R-squared:  0.9884, Adjusted R-squared:  0.9861 
## F-statistic: 425.3 on 1 and 5 DF,  p-value: 4.962e-06

Since Species is a count with no maximum, we may really want to run a Poisson regression analysis.

m <- glm(Species ~ log(Area),      
         data = Sleuth3::case0801,
         family = poisson)
summary(m)
## 
## Call:
## glm(formula = Species ~ log(Area), family = poisson, data = Sleuth3::case0801)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.44771   1.42096  -1.58787  -0.06253   0.08458   0.43416   0.27131  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.84158    0.20679   8.906   <2e-16 ***
## log(Area)    0.26251    0.02216  11.845   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 224.0801  on 6  degrees of freedom
## Residual deviance:   5.0141  on 5  degrees of freedom
## AIC: 46.119
## 
## Number of Fisher Scoring iterations: 4

SLR with binary explanatory variable

ggplot(Sleuth3::case0801, 
       aes(x = 1*I(Area>100), y = Species)) + 
  geom_point() + 
  scale_y_log10() + 
  labs(title = "Number of reptile and amphibian species for West Indies islands") +
  geom_smooth(method = "lm", formula = y ~ x)

Multiple regression

library("dplyr") # may need to install.packages("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
case0901 <- Sleuth3::case0901 %>%
  mutate(Time = ifelse(Time == 1, "Late", "Early"))
dim(case0901)
## [1] 24  3
summary(case0901)
##     Flowers          Time             Intensity  
##  Min.   :31.30   Length:24          Min.   :150  
##  1st Qu.:45.42   Class :character   1st Qu.:300  
##  Median :54.75   Mode  :character   Median :525  
##  Mean   :56.14                      Mean   :525  
##  3rd Qu.:64.45                      3rd Qu.:750  
##  Max.   :78.00                      Max.   :900
head(case0901)
##   Flowers Time Intensity
## 1    62.3 Late       150
## 2    77.4 Late       150
## 3    55.3 Late       300
## 4    54.2 Late       300
## 5    49.6 Late       450
## 6    61.9 Late       450

SLR

Let’s start by ignore Time.

ggplot(case0901, 
       aes(x = Intensity, y = Flowers)) + 
  geom_point() + 
  labs(title = "Effects of Light on Meadowfoam Flowering") +
  geom_smooth(method = "lm", formula = y ~ x)

m <- lm(Flowers ~ Intensity, data = case0901)
summary(m)
## 
## Call:
## lm(formula = Flowers ~ Intensity, data = case0901)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7314  -7.8052   0.0186   6.1857  13.2686 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 77.385000   4.161186  18.597 6.06e-15 ***
## Intensity   -0.040471   0.007123  -5.682 1.03e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.94 on 22 degrees of freedom
## Multiple R-squared:  0.5947, Adjusted R-squared:  0.5763 
## F-statistic: 32.28 on 1 and 22 DF,  p-value: 1.03e-05

Multiple regression

Let’s add Time

ggplot(case0901, 
       aes(x = Intensity, y = Flowers, color = Time, shape = Time)) + 
  geom_point() + 
  labs(title = "Effects of Light on Meadowfoam Flowering") +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

By default, ggplot2::geom_smooth plots separate lines for the two Times and thus is equivalent to a model with an interaction between Time and Intensity. The fact that these lines look parallel indicates an interaction is not needed.

m <- lm(Flowers ~ Intensity*Time, data = case0901)
anova(m)
## Analysis of Variance Table
## 
## Response: Flowers
##                Df  Sum Sq Mean Sq F value    Pr(>F)    
## Intensity       1 2579.75 2579.75 59.2597 2.101e-07 ***
## Time            1  886.95  886.95 20.3742 0.0002119 ***
## Intensity:Time  1    0.58    0.58  0.0132 0.9095675    
## Residuals      20  870.66   43.53                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction is not significant, so it will simplify our interpretation if we exclude it. (Please note though, that you could keep it in the model.)

m <- lm(Flowers ~ Intensity+Time, data = case0901)
anova(m)
## Analysis of Variance Table
## 
## Response: Flowers
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Intensity  1 2579.75 2579.75  62.181 1.037e-07 ***
## Time       1  886.95  886.95  21.379 0.0001464 ***
## Residuals 21  871.24   41.49                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m)
## 
## Call:
## lm(formula = Flowers ~ Intensity + Time, data = case0901)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.652 -4.139 -1.558  5.632 12.165 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.464167   3.273772  25.495  < 2e-16 ***
## Intensity    -0.040471   0.005132  -7.886 1.04e-07 ***
## TimeLate    -12.158333   2.629557  -4.624 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.441 on 21 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:   0.78 
## F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08