To follow along, use the lab07 code. Also, make sure to load the tidyverse package.

if (!require("tidyverse")) {
  install.packages("tidyverse")
  library("tidyverse")
}

Overview

Recall that a regression model has the following 4 main assumptions:

Throughout this lab, we will investigate these assumptions by simulating data that violates the assumption and determining how the diagnostics (including plots) express this violation.

Simulation

Throughout this lab, we will be

  1. simulating data,
  2. running a regression with those data, and
  3. looking at the diagnostics.

Simulate data

For fully reproducible results, we will set the seed.

set.seed(20170320)
n <- 100 
x <- rnorm(n)
b0 <- 1
b1 <- 2
sigma <- 0.4
error <- rnorm(n,0,sigma)
y <- b0+b1*x+error

d <- data.frame(x=x,y=y)

We can visualize the data

plot(y ~ x, data = d)

or using ggplot

library("ggplot2")
ggplot(d, aes(x=x, y=y)) +
  geom_point() +
  theme_bw()

Run the regression

m <- lm(y ~ x, data = d)
m
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##       1.037        2.011
summary(m)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98181 -0.28723  0.01533  0.28822  0.88930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03697    0.03714   27.92   <2e-16 ***
## x            2.01063    0.03706   54.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.371 on 98 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9674 
## F-statistic:  2943 on 1 and 98 DF,  p-value: < 2.2e-16

Plot the line

plot(y ~ x, data = d)
abline(m, col='blue')

or using ggplot

ggplot(d, aes(x=x, y=y)) +
  geom_point() + 
  stat_smooth(method = "lm", se = FALSE) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Look at diagnostics

opar = par(mfrow=c(2,3))  # Create a 2x3 grid of plots and save the original settings
plot(m, 1:6, ask=FALSE)   # Plot all 6 diagnostics plots

par(opar)                 # Return to original graphics settings

These are what plots look like when all model assumptions are satisfied.

Activity

Simulate the following data

  • There are 234 total observations.
  • Explanatory variable is a random uniform on (0,1).
  • Intercept is -0.5, slope is 25, variance is 9.

Then run the regression diagnostic plots.

Click for solution
n <- 234
x <- runif(n)
y <- -0.5 + 25*x + rnorm(n,0,sqrt(9))
m <- lm(y~x)
opar = par(mfrow=c(2,3))  # Create a 2x3 grid of plots and save the original settings
plot(m, 1:6, ask=FALSE)   # Plot all 6 diagnostics plots

par(opar)                 # Return to original graphics settings
Again, these plots satisfy all model assumptions.

Normality

There are many ways for the errors to not have a normal distribution. We will consider errors that are heavy-tailed, right-skewed, and left-skewed.

n <- 100
errors <- data.frame(rep = 1:n,
                     normal       = rnorm(n),
                     heavy_tailed = rt(n, df=3),
                     right_skewed = exp(rnorm(n))) %>%
  
  mutate(right_skewed = right_skewed - exp(1/2),  # make sure errors have expectation of 0
         left_skewed  = 0 - right_skewed) 

errors_long <- errors %>%
  tidyr::gather(distribution,value,-rep) %>%
  mutate(distribution = factor(distribution, 
                               levels = c("normal",
                                          "heavy_tailed",
                                          "right_skewed",
                                          "left_skewed")))

Before considering the regression, let’s take a look at these errors

ggplot(errors_long, aes(x=value)) + 
  geom_histogram() + 
  facet_grid(. ~ distribution) + 
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(errors_long, aes(sample=value)) + 
  geom_qq() + 
  geom_qq_line() + 
  facet_grid(. ~ distribution) + 
  theme_bw()

In base R, we can use the qqnorm() and qqline() functions to manually create the qqplot.

opar = par(mfrow=c(1,4))
qqnorm(errors$normal, main="Normal")
qqline(errors$normal)
qqnorm(errors$heavy_tailed, main="Heavy-tailed")
qqline(errors$heavy_tailed)
qqnorm(errors$right_skewed, main="Right-skewed")
qqline(errors$right_skewed)
qqnorm(errors$left_skewed, main="Left-skewed")
qqline(errors$left_skewed)

par(opar)

Regression

Of course, we don’t see these errors directly. Instead, we obtain the residuals which are estimates of the errors.

For simplicity, we will assume the true regression model has intercept 0 and slope 0. Thus the response is really just the errors themselves, but we will fit the simple linear regression model.

y <- errors %>%
  mutate(x = rnorm(n))

models <- list()
models[[1]] <- lm(normal       ~ x, data=y)
models[[2]] <- lm(heavy_tailed ~ x, data=y)
models[[3]] <- lm(right_skewed ~ x, data=y)
models[[4]] <- lm(left_skewed  ~ x, data=y)
names(models) <- c("normal","heavy_tailed","right_skewed","left_skewed")

Normal

opar <- par(mfrow=c(2,3))
plot(models$normal, 1:6, ask=FALSE)

par(opar)

Heavy_tailed

opar <- par(mfrow=c(2,3))
plot(models$heavy_tailed, 1:6, ask=FALSE)

par(opar)

Right_skewed

opar <- par(mfrow=c(2,3))
plot(models$right_skewed, 1:6, ask=FALSE)

par(opar)

Left_skewed

opar <- par(mfrow=c(2,3))
plot(models$left_skewed, 1:6, ask=FALSE)

par(opar)

QQ-plots

Focusing on the qqplots

opar = par(mfrow=c(1,4))
for (i in 1:4) plot(models[[i]], 2, main=names(models)[i])

par(opar)

Activity

Using the same errors, simulate data with an intercept of 10 and a slope of -5. Run the regression analysis and view the QQ-plots.

Click for solution
b0 <-  10
b1 <- -5
y_new <- y %>% 
  mutate(y_normal       = b0+b1*x + normal,
         y_heavy_tailed = b0+b1*x + heavy_tailed,
         y_right_skewed = b0+b1*x + right_skewed,
         y_left_skewed  = b0+b1*x + left_skewed)

models <- list()
models[[1]] <- lm(y_normal       ~ x, data=y_new)
models[[2]] <- lm(y_heavy_tailed ~ x, data=y_new)
models[[3]] <- lm(y_right_skewed ~ x, data=y_new)
models[[4]] <- lm(y_left_skewed  ~ x, data=y_new)
names(models) <- c("normal","heavy_tailed","right_skewed","left_skewed")

opar = par(mfrow=c(1,4))
for (i in 1:4) plot(models[[i]], 2, main=names(models)[i])

par(opar)
If you have extra time, try changing the values in the distributions for the errors, e.g. have a t distribution with a larger or smaller variance.

Constant variance

In this section, we will assume the errors are normally distributed but look at what happens when the variance is not constant. The most common scenario is for the variance to increase as the expected response increases. If the slope is positive (negative), this means as the explanatory variable increases (decreases) the error variance increases.

Assume positive slope

n  <- 100
b0 <- 0
b1 <- 1
d  <- data.frame(x = runif(n)) %>%
  mutate(error = rnorm(n,0,x),
         y = b0+b1*x+error)
  
m <- lm(y~x, data=d)
opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

Residuals vs fitted values

A (side-ways) funnel pattern is observed in the residuals vs fitted values plot.

plot(m, 1)

Absolute residuals vs fitted values

A half funnel pattern is observed in the absolute residuals vs fitted values plot and the (red) smoother line indicates an increasing pattern.

plot(m,3)

Activity

Simulate data that has a negative slope and an error variance that increases with the fitted value.

Click for solution
b1 <- -1
d  <- d %>%
  mutate(error = rnorm(n,0,1-x), # sd is now related to 1-x
         y = b0+b1*x+error)
  
m <- lm(y~x, data=d)
opar <- par(mfrow=c(1,2))
plot(m, c(1,3), ask=FALSE)

par(opar)

Independence

There are many ways to violate the assume of independence of the errors in linear regression. The most common way is for there to be serial correlation, i.e. that the errors are related due to the time the samples were obtained. If you know when the samples were obtained, you should plot the residuals vs that time. Often the dataset does not have this information, but typically the observations are recorded sequentially. So, we can use the row number as a proxy for the observation time. Thus at a minimum, I suggest you plot residuals vs row number.

Typically, you will have to create the row number (or index) variable.

Independent errors

n <- 100
b0 <-1 
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
  mutate(
    error =rnorm(n),
    y = b0+b1*x + error)

m <- lm(y ~ x, data = d)

The m object has several components take a look using the names() function.

names(m)
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
##  [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
## [11] "terms"         "model"

The residuals component provides the fitted values in the same order as they were in the data set, so plot these.

plot(m$residuals)

Alternatively with ggplot and an easy smoothing line

dd <- data.frame(index = 1:length(m$residuals),
                 residuals = m$residuals)
ggplot(dd, aes(index, residuals)) +
  geom_point() + 
  stat_smooth(se = FALSE) +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Serially correlated errors

d <- d %>% 
  mutate(error = sin(2*pi*(1:n)/n)+rnorm(n),
         y = b0+b1*x+error)

m <- lm(y ~ x, data = d)

dd <- data.frame(index = 1:length(m$residuals),
                 residuals = m$residuals)
ggplot(dd, aes(index, residuals)) +
  geom_point() + 
  stat_smooth(se = FALSE) +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It is certainly not obvious that the errors have a sin wave structure, but it is pretty clear something is wrong since the residuals for early observations are positive and for later observations are negative.

There is nothing obvious about the default diagnostic plots.

opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

Activity

Simulate errors that have a positive trend, fit the regression, and plot the residuals vs index.

Click for solution
d <- d %>% 
  mutate(error = (1:n)+rnorm(n), # This is a bit extreme
         y = b0+b1*x+error)

m <- lm(y ~ x, data = d)
plot(m$residuals)

Linearity

The linearity assumption assumes a linear relationship between the expected response and the explanatory variable. Thus, the linearity assumption is best looked at by plotting the residuals vs the explanatory variable and looking for a pattern.

Linear

If the relationship is truly linear, this plot will not show any patterns.

n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
  mutate(y = b0+b1*x + rnorm(n))
                
m <- lm(y ~ x, data = d)

d$residuals <- m$residuals

ggplot(d, aes(x, residuals)) + 
  geom_point() + 
  stat_smooth(se = FALSE) + 
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Non-linear relationship

n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
  mutate(y = b0+b1*x+x^2 + rnorm(n))
                
m <- lm(y ~ x, data = d)

d$residuals <- m$residuals

ggplot(d, aes(x, residuals)) + 
  geom_point() + 
  stat_smooth(se = FALSE) + 
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This will also show up in the two residuals vs fitted values plots.

opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

But when the models get more complex, this relationship might not be clear from these plots. Therefore, I suggest you plot residuals versus each explanatory variable.

Activity

Simulate explanatory variables from a normal distribution with mean 1 and standard deviation 0.1. Then simulate data with an intercept of 1, slope of 2, and error standard deviation of 1. Plot residuals vs the explanatory variable. What pattern do you see?

Click for solution
n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n, 1, 0.1)) %>%
  mutate(y = b0+b1*x+x^2 + rnorm(n))
                
m <- lm(y ~ x, data = d)

d$residuals <- m$residuals
plot(residuals ~ x, data = d)
Even non-linear relationships can be approximately linear.

Summary

In this lab, we considered the 4 main regression assumptions how violation of these assumptions get expressed in regression diagnostics. Of course, there is no reason that only a single assumption is violated at a time. Thus, you may want to simulate data with a variety of violates and see how those combinations of violations express themselves in the diagnostics. In reality, all the assumptions are violated all the time and we are trying to identify (via the diagnostics) when are the assumptions violated so badly that the model is ineffective for its intended purpose.

An alternative version of these plots

# install.packages("ggResidpanel")
library("ggResidpanel")

resid_panel(m, plots="R")

# resid_panel(m, plots="SAS")
resid_xpanel(m)