To follow along, use the lab07 code. Also, make sure to load the tidyverse package.
if (!require("tidyverse")) {
install.packages("tidyverse")
library("tidyverse")
}
Recall that a regression model has the following 4 main assumptions:
Throughout this lab, we will investigate these assumptions by simulating data that violates the assumption and determining how the diagnostics (including plots) express this violation.
Throughout this lab, we will be
For fully reproducible results, we will set the seed.
set.seed(20170320)
n <- 100
x <- rnorm(n)
b0 <- 1
b1 <- 2
sigma <- 0.4
error <- rnorm(n,0,sigma)
y <- b0+b1*x+error
d <- data.frame(x=x,y=y)
We can visualize the data
plot(y ~ x, data = d)
or using ggplot
library("ggplot2")
ggplot(d, aes(x=x, y=y)) +
geom_point() +
theme_bw()
m <- lm(y ~ x, data = d)
m
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Coefficients:
## (Intercept) x
## 1.037 2.011
summary(m)
##
## Call:
## lm(formula = y ~ x, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98181 -0.28723 0.01533 0.28822 0.88930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03697 0.03714 27.92 <2e-16 ***
## x 2.01063 0.03706 54.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.371 on 98 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9674
## F-statistic: 2943 on 1 and 98 DF, p-value: < 2.2e-16
Plot the line
plot(y ~ x, data = d)
abline(m, col='blue')
or using ggplot
ggplot(d, aes(x=x, y=y)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
opar = par(mfrow=c(2,3)) # Create a 2x3 grid of plots and save the original settings
plot(m, 1:6, ask=FALSE) # Plot all 6 diagnostics plots
par(opar) # Return to original graphics settings
These are what plots look like when all model assumptions are satisfied.
Simulate the following data
Then run the regression diagnostic plots.
n <- 234
x <- runif(n)
y <- -0.5 + 25*x + rnorm(n,0,sqrt(9))
m <- lm(y~x)
opar = par(mfrow=c(2,3)) # Create a 2x3 grid of plots and save the original settings
plot(m, 1:6, ask=FALSE) # Plot all 6 diagnostics plots
par(opar) # Return to original graphics settings
Again, these plots satisfy all model assumptions.
There are many ways for the errors to not have a normal distribution. We will consider errors that are heavy-tailed, right-skewed, and left-skewed.
n <- 100
errors <- data.frame(rep = 1:n,
normal = rnorm(n),
heavy_tailed = rt(n, df=3),
right_skewed = exp(rnorm(n))) %>%
mutate(right_skewed = right_skewed - exp(1/2), # make sure errors have expectation of 0
left_skewed = 0 - right_skewed)
errors_long <- errors %>%
tidyr::gather(distribution,value,-rep) %>%
mutate(distribution = factor(distribution,
levels = c("normal",
"heavy_tailed",
"right_skewed",
"left_skewed")))
Before considering the regression, let’s take a look at these errors
ggplot(errors_long, aes(x=value)) +
geom_histogram() +
facet_grid(. ~ distribution) +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(errors_long, aes(sample=value)) +
geom_qq() +
geom_qq_line() +
facet_grid(. ~ distribution) +
theme_bw()
In base R, we can use the qqnorm()
and qqline()
functions to manually create the qqplot.
opar = par(mfrow=c(1,4))
qqnorm(errors$normal, main="Normal")
qqline(errors$normal)
qqnorm(errors$heavy_tailed, main="Heavy-tailed")
qqline(errors$heavy_tailed)
qqnorm(errors$right_skewed, main="Right-skewed")
qqline(errors$right_skewed)
qqnorm(errors$left_skewed, main="Left-skewed")
qqline(errors$left_skewed)
par(opar)
Of course, we don’t see these errors directly. Instead, we obtain the residuals which are estimates of the errors.
For simplicity, we will assume the true regression model has intercept 0 and slope 0. Thus the response is really just the errors themselves, but we will fit the simple linear regression model.
y <- errors %>%
mutate(x = rnorm(n))
models <- list()
models[[1]] <- lm(normal ~ x, data=y)
models[[2]] <- lm(heavy_tailed ~ x, data=y)
models[[3]] <- lm(right_skewed ~ x, data=y)
models[[4]] <- lm(left_skewed ~ x, data=y)
names(models) <- c("normal","heavy_tailed","right_skewed","left_skewed")
opar <- par(mfrow=c(2,3))
plot(models$normal, 1:6, ask=FALSE)
par(opar)
opar <- par(mfrow=c(2,3))
plot(models$heavy_tailed, 1:6, ask=FALSE)
par(opar)
opar <- par(mfrow=c(2,3))
plot(models$right_skewed, 1:6, ask=FALSE)
par(opar)
opar <- par(mfrow=c(2,3))
plot(models$left_skewed, 1:6, ask=FALSE)
par(opar)
Focusing on the qqplots
opar = par(mfrow=c(1,4))
for (i in 1:4) plot(models[[i]], 2, main=names(models)[i])
par(opar)
Using the same errors, simulate data with an intercept of 10 and a slope of -5. Run the regression analysis and view the QQ-plots.
b0 <- 10
b1 <- -5
y_new <- y %>%
mutate(y_normal = b0+b1*x + normal,
y_heavy_tailed = b0+b1*x + heavy_tailed,
y_right_skewed = b0+b1*x + right_skewed,
y_left_skewed = b0+b1*x + left_skewed)
models <- list()
models[[1]] <- lm(y_normal ~ x, data=y_new)
models[[2]] <- lm(y_heavy_tailed ~ x, data=y_new)
models[[3]] <- lm(y_right_skewed ~ x, data=y_new)
models[[4]] <- lm(y_left_skewed ~ x, data=y_new)
names(models) <- c("normal","heavy_tailed","right_skewed","left_skewed")
opar = par(mfrow=c(1,4))
for (i in 1:4) plot(models[[i]], 2, main=names(models)[i])
par(opar)
If you have extra time, try changing the values in the distributions for the errors, e.g. have a t distribution with a larger or smaller variance.
In this section, we will assume the errors are normally distributed but look at what happens when the variance is not constant. The most common scenario is for the variance to increase as the expected response increases. If the slope is positive (negative), this means as the explanatory variable increases (decreases) the error variance increases.
n <- 100
b0 <- 0
b1 <- 1
d <- data.frame(x = runif(n)) %>%
mutate(error = rnorm(n,0,x),
y = b0+b1*x+error)
m <- lm(y~x, data=d)
opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)
par(opar)
A (side-ways) funnel pattern is observed in the residuals vs fitted values plot.
plot(m, 1)
A half funnel pattern is observed in the absolute residuals vs fitted values plot and the (red) smoother line indicates an increasing pattern.
plot(m,3)
Simulate data that has a negative slope and an error variance that increases with the fitted value.
b1 <- -1
d <- d %>%
mutate(error = rnorm(n,0,1-x), # sd is now related to 1-x
y = b0+b1*x+error)
m <- lm(y~x, data=d)
opar <- par(mfrow=c(1,2))
plot(m, c(1,3), ask=FALSE)
par(opar)
There are many ways to violate the assume of independence of the errors in linear regression. The most common way is for there to be serial correlation, i.e. that the errors are related due to the time the samples were obtained. If you know when the samples were obtained, you should plot the residuals vs that time. Often the dataset does not have this information, but typically the observations are recorded sequentially. So, we can use the row number as a proxy for the observation time. Thus at a minimum, I suggest you plot residuals vs row number.
Typically, you will have to create the row number (or index) variable.
n <- 100
b0 <-1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
mutate(
error =rnorm(n),
y = b0+b1*x + error)
m <- lm(y ~ x, data = d)
The m
object has several components take a look using the names()
function.
names(m)
## [1] "coefficients" "residuals" "effects" "rank" "fitted.values"
## [6] "assign" "qr" "df.residual" "xlevels" "call"
## [11] "terms" "model"
The residuals
component provides the fitted values in the same order as they were in the data set, so plot these.
plot(m$residuals)
Alternatively with ggplot
and an easy smoothing line
dd <- data.frame(index = 1:length(m$residuals),
residuals = m$residuals)
ggplot(dd, aes(index, residuals)) +
geom_point() +
stat_smooth(se = FALSE) +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The linearity assumption assumes a linear relationship between the expected response and the explanatory variable. Thus, the linearity assumption is best looked at by plotting the residuals vs the explanatory variable and looking for a pattern.
If the relationship is truly linear, this plot will not show any patterns.
n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
mutate(y = b0+b1*x + rnorm(n))
m <- lm(y ~ x, data = d)
d$residuals <- m$residuals
ggplot(d, aes(x, residuals)) +
geom_point() +
stat_smooth(se = FALSE) +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
mutate(y = b0+b1*x+x^2 + rnorm(n))
m <- lm(y ~ x, data = d)
d$residuals <- m$residuals
ggplot(d, aes(x, residuals)) +
geom_point() +
stat_smooth(se = FALSE) +
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This will also show up in the two residuals vs fitted values plots.
opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)
par(opar)
But when the models get more complex, this relationship might not be clear from these plots. Therefore, I suggest you plot residuals versus each explanatory variable.
Simulate explanatory variables from a normal distribution with mean 1 and standard deviation 0.1. Then simulate data with an intercept of 1, slope of 2, and error standard deviation of 1. Plot residuals vs the explanatory variable. What pattern do you see?
n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n, 1, 0.1)) %>%
mutate(y = b0+b1*x+x^2 + rnorm(n))
m <- lm(y ~ x, data = d)
d$residuals <- m$residuals
plot(residuals ~ x, data = d)
Even non-linear relationships can be approximately linear.
In this lab, we considered the 4 main regression assumptions how violation of these assumptions get expressed in regression diagnostics. Of course, there is no reason that only a single assumption is violated at a time. Thus, you may want to simulate data with a variety of violates and see how those combinations of violations express themselves in the diagnostics. In reality, all the assumptions are violated all the time and we are trying to identify (via the diagnostics) when are the assumptions violated so badly that the model is ineffective for its intended purpose.
# install.packages("ggResidpanel")
library("ggResidpanel")
resid_panel(m, plots="R")
# resid_panel(m, plots="SAS")
resid_xpanel(m)