Regression models explain the relationship between a response variable (\(Y\)) and a collection of explanatory variables (\(X\)). We use different types of regression models depending on the type of response variable we have:
Linear regression (described below) assumes a normal (of Gaussian)
distribution for the data (but really the errors). Poisson
regression assumes a Poisson
distribution for the data.
Logistic regression assumes a binomial
distribution for the data (or Bernoulli when the response is
binary).
Regression models alter a bit depending on the data type of the explanatory variables. The most natural approach is when the explanatory variable(s) are continuous because then we can think about the regression model drawing a line through the data. But we have a lot of flexibility on how to include this continuous variable in the model including take a logarithm of the variable and squaring the variable. When the explanatory variables are categorical, then we need to construct dummy (or indicator) variables to include in the regression model. In addition, if we have interaction terms (which we generally think of as multiplying explanatory variables), then interpretation of the regression model becomes more difficult and nuanced.
The simple linear regression (SLR) models expands the applicability
of normal based modeling be shifting the mean according to one
explanatory variable. I use the terms response
variable (\(Y\)) and
explanatory variable (\(X\))
to
identify the variable that explains the change in the response
variable.
To run a simple linear regression model, we need to record both the response and explanatory variable for each observation, i.e. we have a double \((Y_i,X_i)\) for all \(i=1,\ldots,n\). Then our model is \[Y_i \stackrel{ind}{\sim} N(\beta_0+\beta_1X_i,\sigma^2).\] Although this is a succinct way of writing the model, I think a more intuitive way to write the model is \[Y_i = \beta_0+\beta_1X_i+\epsilon_i, \qquad \epsilon_i\stackrel{ind}{\sim} N(0,\sigma^2).\] This notation more directly provides the 4 assumptions in a simple linear regression model:
The last assumption is most straight-forward when you write \[E[Y_i] = \beta_0+\beta_1X_i.\]
Now we could go through all the formulas related with the SLR model, but suffice it to say that we can construct confidence intervals and hypothesis tests for any \(\beta\) and these involve the \(T\)-distribution. Software generally defaults to providing hypothesis tests relative to the \(\beta\)s being 0. This makes sense for \(\beta_1\) since, if it is 0, then that means the explanatory variable has no effect on the response variable.
All the usual caveats apply for hypothesis tests here. Have you read the ASA Statement on \(p\)-Values yet?
The most useful aspect of SLR models is the simplicity of interpretation:
For each unit increase in the explanatory variable, the mean of the response variable increases by \(\beta_1\).
Multiple (linear) regression is the extension of the SLR model to more than one continuous or binary explanatory variable. It also includes quadratic terms as well as interactions. Now, for each observation, you need to collect the value of the response variable as well as all of the explanatory variables, i.e. \((Y_i,X_{i,1},\ldots,X_{i,p})\) for all \(i=1,\ldots,n\). (I’m thinking spreadsheets might be pretty handy about now.)
The model can be written \[Y_i = \beta_0+\beta_1X_{i,p}+\ldots+\beta_p X_{i,p}+\epsilon_i, \qquad \epsilon_i\stackrel{ind}{\sim} N(0,\sigma^2).\] But, it is much more succinct when you use some linear algebra \[Y = X\beta + e, \quad e \sim N(0,\sigma^2\mathrm{I})\] but now we have to define
Note: the models aren’t quite identical since the matrix version does not have an explicit intercept, but instead you can include an intercept by having the first column of X be all 1s.
Using some linear algebra and assuming we have a full rank \(X\), the maximum likelihood estimator for \(\beta\) is nice and succinct \[\hat\beta_{MLE} = (X^\top X)^{-1}X^\top y.\] How pretty is that!! (Break this out at parties and impress all your friends. Don’t write it down, just say it.)
Multiple regression models are much better at representing reality than simple linear regression models because, generally,
With this flexibility, we do lose some interpretability. If there are no interactions, then we can interpret the coefficient for the \(j\)th explanatory variable like this
holding all other variables constant, a one unit change in the \(j\)th explanatory variable increases the mean of the response variable by \(\beta_j\).
When you read about this in the newspaper or a journal article, they will often use the phrase “after adjusting for …” or “after controlling for …”.
When interactions are present, everything becomes much more complicated. Are you ready to use figures yet?
In this section, we will visualize a bunch of regression models using the ggplot2 R package. First, we need to load the package
library("ggplot2") # you may need to install it first using install.packages("ggplot2")
Also, the data we will use is the in Sleuth3 R package, so it also needs to be loaded.
library("Sleuth3") # you may need to install it first using install.packages("Sleuth3")
We will start with the simple linear regression (SLR) model, but using logarithms we will show that there is a lot of flexibility even in this relatively trivial model. The simple in SLR indicates that there is a single explanatory variable and, typically, it is continuous.
ggplot(Sleuth3::case0801,
aes(x = Area, y = Species)) +
geom_point() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)
ggplot(Sleuth3::case0801,
aes(x = Area, y = Species)) +
geom_point() +
scale_x_log10() +
labs(title = "Number of reptile and amphibian species for West Indies islands") +
geom_smooth(method = "lm", formula = y ~ x)