library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.2
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
theme_set(theme_bw())
library("MASS")
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library("MuMIn")
## Warning: package 'MuMIn' was built under R version 4.2.3
library("glmnet")
## Warning: package 'glmnet' was built under R version 4.2.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
library("knitr")
library("kableExtra")
## Warning: package 'kableExtra' was built under R version 4.2.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
These slides intend to introduce the notations of variable selection using information criteria and regularized regression techniques. Both of these methods have the idea of adding a penalty into the methodology and thus fall under an umbrella of penalized regression techniques.
For simplicity in the analyses, I will use a subset of the diamonds data set where we randomly select 100 observations and eliminate (for simplicity) the categorical variables.
set.seed(20230425)
n <- 100
d <- diamonds %>%
dplyr::select(-cut, -color, -clarity) %>%
rename(lprice = price) %>%
mutate(lprice = log(lprice))
train <- d %>% sample_n(n)
test <- d %>% sample_n(n)
Let’s fit a couple of regression models to compare with our penalized approaches.
Intercept only
m <- lm(lprice ~ 1, data = train)
p <- predict(m, newdata = test)
error <- data.frame(
group = "Regression",
method = "Intercept-only",
in_sample = mean(m$residuals^2),
out_of_sample = mean((p - test$lprice)^2)
)
All main effects of the continuous variables.
m <- lm(lprice ~ ., data = train)
p <- predict(m, newdata = test)
error <- bind_rows(
error,
data.frame(
group = "Regression",
method = "Main effects",
in_sample = mean(m$residuals^2),
out_of_sample = mean((p - test$lprice)^2)
)
)
Now includes all interactions
m <- lm(lprice ~ .^2, data = train)
p <- predict(m, newdata = test)
error <- bind_rows(
error,
data.frame(
group = "Regression",
method = "Interactions",
in_sample = mean(m$residuals^2),
out_of_sample = mean((p - test$lprice)^2)
)
)
In regression models with many explanatory variables, there is a question of which explanatory variables should be included in a model. If, ultimately, we choose a single set, this is a model or variable selection problem.
We already know we can formally test two models that are nested.
m1 <- lm(lprice ~ ., data = train)
m2 <- lm(lprice ~ .^2, data = train)
anova(m1, m2)
## Analysis of Variance Table
##
## Model 1: lprice ~ carat + depth + table + x + y + z
## Model 2: lprice ~ (carat + depth + table + x + y + z)^2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 93 3.8094
## 2 78 3.0168 15 0.79257 1.3661 0.1855
We can utilize this approach to search through models sequentially to determine which explanatory variables (and interactions) should be included in the final model.
R has a step()
function that will automatically perform
this process, although rather than utilizing the F-test and associated
p-value it will use Akaike’s Information Criterion (AIC).
The general AIC formula is -2 times the log likelihood plus two times
the number of \(\beta\)s (call this
\(p\)). For linear regression the
formula is \[
AIC: n\log(\hat\sigma^2) + 2p
\] The \(2p\) is a penalty that
attempts to reduce the number of parameters and, therefore explanatory
variables, in the model. We can obtain AIC from a given model using the
extractAIC()
function:
extractAIC(m1)
## [1] 7.0000 -312.7697
extractAIC(m2)
## [1] 22.0000 -306.0963
The first value is \(p\) while the second value is AIC.
Since we are trying to minimize \(\sigma^2\) and minimize the number of
parameters we are looking for models whose AIC is small. Thus, in this
comparison AIC prefers model m1
due to its lower AIC.
An alternative to AIC is BIC which replaces the penalty by \(\log(n)\times p\). Since \(\log(n)>2\) for \(n > 7\), BIC generally suggests smaller models than AIC.
extractAIC(m1, k = log(n))
## [1] 7.0000 -294.5335
extractAIC(m2, k = log(n))
## [1] 22.0000 -248.7825
BIC prefers m1
since, again, it has the smaller BIC.
One approach is to start with the most basic model: one that only includes the intercept. Then try adding variables that improve (decrease) AIC.
m_forward <- step(lm(lprice ~ 1, data = train),
scope = formula(lm(lprice ~ .^2, data = train)),
direction = "forward"
)
## Start: AIC=2.56
## lprice ~ 1
##
## Df Sum of Sq RSS AIC
## + x 1 95.846 4.714 -301.472
## + y 1 95.818 4.741 -300.882
## + z 1 95.114 5.446 -287.031
## + carat 1 89.333 11.226 -214.690
## + table 1 3.967 96.593 0.533
## <none> 100.560 2.558
## + depth 1 0.001 100.558 4.557
##
## Step: AIC=-301.47
## lprice ~ x
##
## Df Sum of Sq RSS AIC
## + carat 1 0.67252 4.0411 -314.87
## <none> 4.7136 -301.47
## + y 1 0.02839 4.6852 -300.08
## + z 1 0.00474 4.7088 -299.57
## + table 1 0.00068 4.7129 -299.49
## + depth 1 0.00036 4.7132 -299.48
##
## Step: AIC=-314.87
## lprice ~ x + carat
##
## Df Sum of Sq RSS AIC
## + carat:x 1 0.163321 3.8778 -316.99
## + z 1 0.115022 3.9261 -315.75
## <none> 4.0411 -314.87
## + y 1 0.065853 3.9752 -314.51
## + depth 1 0.054045 3.9870 -314.21
## + table 1 0.016588 4.0245 -313.28
##
## Step: AIC=-316.99
## lprice ~ x + carat + x:carat
##
## Df Sum of Sq RSS AIC
## <none> 3.8778 -316.99
## + y 1 0.0230570 3.8547 -315.59
## + table 1 0.0131537 3.8646 -315.33
## + z 1 0.0037831 3.8740 -315.09
## + depth 1 0.0023037 3.8754 -315.05
Final model
summary(m_forward)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5168764 0.6401572 3.931654 0.0001592841
## x 0.8404663 0.2118062 3.968092 0.0001397501
## carat 2.1539661 1.5317040 1.406255 0.1628771486
## x:carat -0.2547876 0.1267100 -2.010792 0.0471520524
Training and testing error
p <- predict(m_forward, newdata = test)
error <- bind_rows(
error,
data.frame(
group = "Selection",
method = "Forward",
in_sample = mean(m_forward$residuals^2),
out_of_sample = mean((p - test$lprice)^2)
)
)
Another approach is to start with the largest model and eliminate variables that (when eliminated) decrease AIC.
m_backward <- step(lm(lprice ~ .^2, data = train),
direction = "backward"
)
## Start: AIC=-306.1
## lprice ~ (carat + depth + table + x + y + z)^2
##
## Df Sum of Sq RSS AIC
## - carat:x 1 0.000996 3.0178 -308.06
## - depth:y 1 0.002573 3.0194 -308.01
## - carat:depth 1 0.006399 3.0232 -307.88
## - x:y 1 0.008296 3.0251 -307.82
## - depth:z 1 0.009201 3.0260 -307.79
## - y:z 1 0.010314 3.0271 -307.75
## - carat:z 1 0.012369 3.0292 -307.69
## - depth:x 1 0.018539 3.0354 -307.48
## - carat:table 1 0.034585 3.0514 -306.96
## - carat:y 1 0.058998 3.0758 -306.16
## <none> 3.0168 -306.10
## - x:z 1 0.069490 3.0863 -305.82
## - table:y 1 0.107191 3.1240 -304.61
## - table:x 1 0.179468 3.1963 -302.32
## - table:z 1 0.179820 3.1967 -302.31
## - depth:table 1 0.225851 3.2427 -300.88
##
## Step: AIC=-308.06
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table +
## carat:y + carat:z + depth:table + depth:x + depth:y + depth:z +
## table:x + table:y + table:z + x:y + x:z + y:z
##
## Df Sum of Sq RSS AIC
## - depth:y 1 0.001585 3.0194 -310.01
## - depth:z 1 0.013114 3.0309 -309.63
## - y:z 1 0.017423 3.0353 -309.49
## - x:y 1 0.018038 3.0359 -309.47
## - depth:x 1 0.022474 3.0403 -309.32
## - carat:depth 1 0.028940 3.0468 -309.11
## - carat:table 1 0.033982 3.0518 -308.94
## <none> 3.0178 -308.06
## - x:z 1 0.070127 3.0880 -307.77
## - carat:z 1 0.070772 3.0886 -307.75
## - carat:y 1 0.072388 3.0902 -307.69
## - table:y 1 0.106413 3.1242 -306.60
## - table:z 1 0.183291 3.2011 -304.17
## - table:x 1 0.184756 3.2026 -304.12
## - depth:table 1 0.230872 3.2487 -302.69
##
## Step: AIC=-310.01
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table +
## carat:y + carat:z + depth:table + depth:x + depth:z + table:x +
## table:y + table:z + x:y + x:z + y:z
##
## Df Sum of Sq RSS AIC
## - depth:z 1 0.015162 3.0346 -311.51
## - y:z 1 0.017085 3.0365 -311.45
## - x:y 1 0.019696 3.0391 -311.36
## - depth:x 1 0.030052 3.0495 -311.02
## - carat:depth 1 0.031525 3.0509 -310.97
## - carat:table 1 0.039606 3.0590 -310.71
## <none> 3.0194 -310.01
## - x:z 1 0.075054 3.0945 -309.56
## - carat:z 1 0.076518 3.0959 -309.51
## - carat:y 1 0.078231 3.0976 -309.45
## - table:y 1 0.107596 3.1270 -308.51
## - table:z 1 0.183557 3.2030 -306.11
## - table:x 1 0.193039 3.2125 -305.81
## - depth:table 1 0.230651 3.2501 -304.65
##
## Step: AIC=-311.51
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table +
## carat:y + carat:z + depth:table + depth:x + table:x + table:y +
## table:z + x:y + x:z + y:z
##
## Df Sum of Sq RSS AIC
## - y:z 1 0.002620 3.0372 -313.42
## - carat:depth 1 0.016612 3.0512 -312.96
## <none> 3.0346 -311.51
## - x:y 1 0.061645 3.0962 -311.50
## - carat:z 1 0.063135 3.0977 -311.45
## - depth:x 1 0.064278 3.0989 -311.41
## - carat:y 1 0.064680 3.0993 -311.40
## - carat:table 1 0.069263 3.1038 -311.25
## - x:z 1 0.094890 3.1295 -310.43
## - table:y 1 0.099180 3.1338 -310.29
## - table:z 1 0.168595 3.2032 -308.10
## - table:x 1 0.178070 3.2126 -307.81
## - depth:table 1 0.217293 3.2519 -306.59
##
## Step: AIC=-313.42
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table +
## carat:y + carat:z + depth:table + depth:x + table:x + table:y +
## table:z + x:y + x:z
##
## Df Sum of Sq RSS AIC
## - carat:depth 1 0.014521 3.0517 -314.95
## <none> 3.0372 -313.42
## - carat:z 1 0.061895 3.0991 -313.41
## - carat:y 1 0.063297 3.1005 -313.36
## - depth:x 1 0.064492 3.1017 -313.32
## - carat:table 1 0.084911 3.1221 -312.67
## - x:y 1 0.110684 3.1479 -311.84
## - x:z 1 0.113229 3.1504 -311.76
## - table:y 1 0.122732 3.1599 -311.46
## - table:x 1 0.188291 3.2255 -309.41
## - table:z 1 0.191552 3.2287 -309.31
## - depth:table 1 0.254134 3.2913 -307.39
##
## Step: AIC=-314.95
## lprice ~ carat + depth + table + x + y + z + carat:table + carat:y +
## carat:z + depth:table + depth:x + table:x + table:y + table:z +
## x:y + x:z
##
## Df Sum of Sq RSS AIC
## <none> 3.0517 -314.95
## - carat:table 1 0.08217 3.1339 -314.29
## - carat:y 1 0.12853 3.1802 -312.82
## - carat:z 1 0.12892 3.1806 -312.81
## - depth:x 1 0.16434 3.2161 -311.70
## - table:y 1 0.18633 3.2380 -311.02
## - x:y 1 0.19593 3.2476 -310.72
## - x:z 1 0.20241 3.2541 -310.52
## - table:x 1 0.21190 3.2636 -310.23
## - table:z 1 0.25652 3.3082 -308.88
## - depth:table 1 0.34762 3.3993 -306.16
Final model
summary(m_backward)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 693.9954125 198.98233863 3.487724 0.0007817938
## carat 16.6985110 10.47722252 1.593792 0.1147839305
## depth -10.8918398 3.27313133 -3.327651 0.0013070183
## table -10.6526813 3.28964993 -3.238242 0.0017299373
## x -69.2748537 25.48698778 -2.718048 0.0079928073
## y -65.1820052 22.93834218 -2.841618 0.0056463589
## z 184.1495785 62.99661772 2.923166 0.0044642064
## carat:table -0.2512743 0.16808215 -1.494949 0.1387202199
## carat:y -5.9414008 3.17774681 -1.869690 0.0650529439
## carat:z 9.2349531 4.93182748 1.872522 0.0646556918
## depth:table 0.1670259 0.05432095 3.074796 0.0028512277
## depth:x 0.2678994 0.12671818 2.114136 0.0375046461
## table:x 0.9508779 0.39608467 2.400693 0.0185997380
## table:y 0.8123051 0.36083522 2.251180 0.0270170333
## table:z -2.7059509 1.02445321 -2.641361 0.0098652842
## x:y 4.2572933 1.84422767 2.308442 0.0234620396
## x:z -6.9452919 2.96011494 -2.346291 0.0213454207
Training and testing error
p <- predict(m_backward, newdata = test)
error <- bind_rows(
error,
data.frame(
group = "Selection",
method = "Backward",
in_sample = mean(m_backward$residuals^2),
out_of_sample = mean((p - test$lprice)^2)
)
)
Or we can traverse in both directions starting from somewhere
m_both <- step(lm(lprice ~ ., data = train),
scope = formula(lm(lprice ~ .^2, data = train)),
direction = "both"
)
## Start: AIC=-312.77
## lprice ~ carat + depth + table + x + y + z
##
## Df Sum of Sq RSS AIC
## + depth:table 1 0.15650 3.6529 -314.96
## - table 1 0.00219 3.8116 -314.71
## - x 1 0.00285 3.8123 -314.69
## - y 1 0.00388 3.8133 -314.67
## - depth 1 0.06652 3.8759 -313.04
## <none> 3.8094 -312.77
## - z 1 0.09535 3.9048 -312.30
## + carat:x 1 0.04561 3.7638 -311.97
## + depth:z 1 0.03887 3.7705 -311.80
## + carat:z 1 0.03178 3.7776 -311.61
## + carat:y 1 0.02195 3.7875 -311.35
## + table:z 1 0.01397 3.7954 -311.14
## + depth:y 1 0.01376 3.7956 -311.13
## + depth:x 1 0.01263 3.7968 -311.10
## + table:y 1 0.00599 3.8034 -310.93
## + table:x 1 0.00587 3.8035 -310.92
## + x:z 1 0.00437 3.8050 -310.88
## + carat:depth 1 0.00411 3.8053 -310.88
## + carat:table 1 0.00309 3.8063 -310.85
## + x:y 1 0.00082 3.8086 -310.79
## + y:z 1 0.00031 3.8091 -310.78
## - carat 1 0.86560 4.6750 -294.29
##
## Step: AIC=-314.96
## lprice ~ carat + depth + table + x + y + z + depth:table
##
## Df Sum of Sq RSS AIC
## - x 1 0.00119 3.6541 -316.93
## - y 1 0.00325 3.6562 -316.88
## <none> 3.6529 -314.96
## - z 1 0.08603 3.7389 -314.64
## + carat:x 1 0.03018 3.6227 -313.79
## + x:z 1 0.01586 3.6370 -313.40
## + carat:z 1 0.01486 3.6380 -313.37
## + carat:y 1 0.01221 3.6407 -313.30
## + depth:x 1 0.01205 3.6409 -313.30
## + depth:y 1 0.01047 3.6424 -313.25
## + x:y 1 0.00998 3.6429 -313.24
## + table:y 1 0.00552 3.6474 -313.12
## + table:x 1 0.00500 3.6479 -313.10
## + table:z 1 0.00403 3.6489 -313.08
## + y:z 1 0.00204 3.6509 -313.02
## + depth:z 1 0.00155 3.6514 -313.01
## + carat:depth 1 0.00147 3.6514 -313.01
## + carat:table 1 0.00021 3.6527 -312.97
## - depth:table 1 0.15650 3.8094 -312.77
## - carat 1 0.88162 4.5345 -295.34
##
## Step: AIC=-316.93
## lprice ~ carat + depth + table + y + z + depth:table
##
## Df Sum of Sq RSS AIC
## - y 1 0.00253 3.6566 -318.86
## <none> 3.6541 -316.93
## + carat:y 1 0.00904 3.6451 -315.18
## + carat:z 1 0.00673 3.6474 -315.12
## + table:y 1 0.00408 3.6500 -315.04
## + depth:y 1 0.00320 3.6509 -315.02
## + table:z 1 0.00285 3.6512 -315.01
## + carat:depth 1 0.00197 3.6521 -314.99
## + y:z 1 0.00142 3.6527 -314.97
## + x 1 0.00119 3.6529 -314.96
## + depth:z 1 0.00050 3.6536 -314.95
## + carat:table 1 0.00003 3.6541 -314.93
## - depth:table 1 0.15816 3.8123 -314.69
## - z 1 0.18660 3.8407 -313.95
## - carat 1 0.89011 4.5442 -297.13
##
## Step: AIC=-318.86
## lprice ~ carat + depth + table + z + depth:table
##
## Df Sum of Sq RSS AIC
## <none> 3.6566 -318.86
## + carat:depth 1 0.0043 3.6524 -316.98
## + carat:z 1 0.0033 3.6534 -316.95
## + y 1 0.0025 3.6541 -316.93
## + depth:z 1 0.0023 3.6543 -316.93
## + table:z 1 0.0014 3.6552 -316.90
## + x 1 0.0005 3.6562 -316.88
## + carat:table 1 0.0001 3.6566 -316.86
## - depth:table 1 0.1582 3.8149 -316.63
## - carat 1 0.8887 4.5453 -299.11
## - z 1 7.1337 10.7903 -212.65
Final model
summary(m_both)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.07953828 24.749795370 2.306263 2.329334e-02
## carat -1.09504132 0.229098703 -4.779780 6.467710e-06
## depth -0.91194731 0.404259186 -2.255848 2.639880e-02
## table -0.84991422 0.420502725 -2.021186 4.610443e-02
## z 2.12046543 0.156584802 13.541962 8.202012e-24
## depth:table 0.01386812 0.006875878 2.016924 4.655592e-02
As can be seen from this example, there is no reason that these approaches will lead to the same model.
Training and testing error
p <- predict(m_both, newdata = test)
error <- bind_rows(
error,
data.frame(
group = "Selection",
method = "Forward and backward",
in_sample = mean(m_both$residuals^2),
out_of_sample = mean((p - test$lprice)^2)
)
)
There is no guarantee that using a stepwise approach will lead to the model with the best AIC (or any other criterion). Above we saw that 3 different approaches led to 3 different models.
An alternative to performing a stepwise approach to model selection is to calculate AIC for all models and then choose the model that has the lowest AIC. The number of models we need to consider is the number of explanatory variables raised to the second power (since every explanatory variable can be either in or out of the model). When the number of explanatory variables is small, this is possible. As the number of explanatory variables increases, this may be computationally infeasible.
data.frame(p = 1:20) %>%
mutate(`Number of models` = p^2) %>%
ggplot(aes(x = p, y = `Number of models`)) +
geom_line()
When you include interactions, the number of possible interactions between two explanatory variables is the number of explanatory variables choose 2.
data.frame(p = 1:20) %>%
mutate(`Number of interactions` = choose(p,2)) %>%
ggplot(aes(x = p, y = `Number of interactions`)) +
geom_line()
When we are considering all subsets we can consider including or excluding these interactions as well. If all interactions between two variables to be included or excluded, then we have p squared plus p choose 2 squared possible models.
data.frame(p = 1:20) %>%
mutate(`Number of models` = p^2 + choose(p,2)^2) %>%
ggplot(aes(x = p, y = `Number of models`)) +
geom_line()
But of course we could consider higher order interactions as well.
The bottom line is that even for a small number of explanatory variables, there are a large number of possible models.
Rather than selecting a single model, we can utilize all models and give each model a probability. Then, for each test observation, we can average our predictions across all the models.
m_avg <- lm(lprice ~ ., data = train, na.action = na.fail) %>%
MuMIn::dredge()
## Fixed term is "(Intercept)"
Obtain predictions.
mp <- m_avg %>%
get.models(subset = cumsum(weight) <= .99) %>%
model.avg()
p_train <- predict(mp, newdata = train)
p_test <- predict(mp, newdata = test)
error <- bind_rows(
error,
data.frame(
group = "Model averaging",
method = "AIC",
in_sample = mean((p_train - train$lprice)^2),
out_of_sample = mean((p_test - test$lprice)^2)
)
)
Recall that a multiple regression model can be written \[ Y = X\beta + \epsilon, \quad \epsilon \sim N(0,\sigma^2,\mathrm{I}_n) \] where
The MLE for \(\beta\) is \[ \hat\beta_{MLE} = [X^\top X]^{-1} X^\top y. \]
This MLE is the solution to the formula \[ \hat\beta_{MLE} = \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta), \] i.e. the sum of squared residuals.
In this section, we will be using the glmnet package which takes as input the response vector and the explanatory variable matrix.
train <- list(
y = train$lprice,
x = subset(train, select = -c(lprice)) %>% as.matrix()
)
test <- list(
y = test$lprice,
x = subset(test, select = -c(lprice)) %>% as.matrix()
)
For ridge regression, we add a penalty to the formula above.
Specifically, the ridge regression estimator is the solution to \[
\hat\beta_{ridge}
= \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta)/2n
+ \lambda \sum_{j} \beta_j^2
\] where the penalty is \(\lambda
\sum_{j} \beta_j^2\) for some value of \(\lambda>0\). Note that this corresponds
to the glmnet
definition and may differ from other sources,
but the idea is the same.
Since we are trying to find the minimum and this penalty sums the square of the \(\beta\)s, the solution to this formula will have \(\beta\) closer to zero. Thus, this method is one of the shrinkage methods.
The solution is available in closed form as \[ \hat\beta_{ridge} = [X^\top X + \lambda\mathrm{I}_p]^{-1} X^\top y. \]
We can fit a ridge regression model using the
MASS::lm.ridge()
function.
Let’s utilize the diamonds
data set, but, for
simplicity, only include the continuous variables.
Fit ridge regression
m <- glmnet(
x = train$x,
y = train$y,
alpha = 0 # ridge regression
)
The parameter estimates shrink toward zero as the ridge penalty increase as can be seen in this plot.
plot(m, xvar = "lambda", label = TRUE)
Note that all coefficients estimates converge to zero (although not monotonically), but are never equal to zero.
Estimation of a ridge regression model involves choosing a value for \(\lambda\).
cv_m <- cv.glmnet(
x = train$x,
y = train$y,
alpha = 0, # ridge regression
# default sequence used values of lambda that are too big
lambda = seq(0, 0.01, length = 100)
)
plot(cv_m)
Choose \(\lambda\) that minimizes cross-validation error in the training set.
best_lambda <- cv_m$lambda.min
m <- glmnet(
x = train$x,
y = train$y,
alpha = 0, # ridge regression
lambda = best_lambda
)
p_train <- predict(m, newx = train$x)
p_test <- predict(m, newx = test$x)
error <- bind_rows(
error,
data.frame(
group = "Regularization",
method = "Ridge",
in_sample = mean((p_train - train$y)^2),
out_of_sample = mean((p_test - test$y)^2)
)
)
When I performed this analysis using MASS::lm.ridge
my
in and out of sample error was almost exactly the same as the other
methods. So I’m not sure what is going on with the glmnet
implementation of ridge regression.
There is a Bayesian interpretation of the ridge regression as a normal prior on the coefficients centered at 0. \[ Y \sim N(X\beta, \sigma^2 \mathrm{I}_n), \quad \beta \sim N\left(0, \lambda^{-1}\mathrm{I}_p\right) \]
For least absolute shrinkage and selection operator (LASSO), we add a
penalty to the lm formula above. Specifically, the LASSO
estimate is the solution to \[
\hat\beta_{LASSO}
= \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta)/2n
+ \lambda \sum_{j} |\beta_j|
\] where the penalty is \(\lambda
\sum_{j} |\beta_j|\) for some value of \(\lambda>0\). Note that this corresponds
to the glmnet
definition and may differ from other sources,
but the idea is the same.
Since we are trying to find the minimum and this penalty sums the square of the \(\beta\)s, the solution to this formula will have \(\beta\) closer to zero. Thus, this method is one of the shrinkage methods.
The penalty here is very similar to the ridge penalty, but here we use the absolute value rather than the squared coefficient.
The LASSO point estimate cannot be written analytically, but we have reasonable algorithms to obtain the estimate.
m <- glmnet(
x = train$x,
y = train$y,
alpha = 1 # LASSO
)
Estimated coefficients as a function of the penalty \(\lambda\).
plot(m, xvar = "lambda", label = TRUE)
Note that some of the point estimates are exactly zero, unlike the ridge analysis. Thus, LASSO can perform model selection. With that being said, this is only for the point estimate and does not incorporate uncertainty on the point estimate.
We still need to decide on a penalty to use. We can choose based on cross-validation.
cv_m <- cv.glmnet(
x = train$x,
y = train$y,
alpha = 1 # LASSO
)
plot(cv_m)
Choose \(\lambda\) that minimizes cross-validation error in the training set.
best_lambda <- cv_m$lambda.min
m <- glmnet(
x = train$x,
y = train$y,
alpha = 1, # LASSO
lambda = best_lambda
)
p_train <- predict(m, newx = train$x)
p_test <- predict(m, newx = test$x)
error <- bind_rows(
error,
data.frame(
group = "Regularization",
method = "LASSO",
in_sample = mean((p_train - train$y)^2),
out_of_sample = mean((p_test - test$y)^2)
)
)
There is a Bayesian interpretation of the LASSO as a Laplace (or double exponential) prior on the coefficients centered at 0.
The elastic net is a combination of the ridge and LASSO approaches.
As parameterized by glmnet()
, the penalty is (I think)
\[ \hat\beta_{LASSO} = \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta) + \lambda \left[ (1-\alpha) \sum_j \beta_j^2 + \alpha \sum_{j} |\beta_j| \right] \]
Thus, when \(\alpha = 1\) we have LASSO while if \(\alpha=0\) we have ridge regression. Values of \(\alpha\) between 0 and 1 are a mixture between the two.
cv_m <- cv.glmnet(
x = train$x,
y = train$y,
alpha = 0.5 # LASSO
)
Choose \(\lambda\) that minimizes cross-validation error in the training set.
best_lambda <- cv_m$lambda.min
m <- glmnet(
x = train$x,
y = train$y,
alpha = 0.5, # LASSO
lambda = best_lambda
)
p_train <- predict(m, newx = train$x)
p_test <- predict(m, newx = test$x)
error <- bind_rows(
error,
data.frame(
group = "Regularization",
method = "Elastic net (alpha=0.5)",
in_sample = mean((p_train - train$y)^2),
out_of_sample = mean((p_test - test$y)^2)
)
)
In these slides, we introduced a number of different approaches within the context of regression for deciding what explanatory variables to include as well as estimating the coefficients for included explanatory variables. All methods had in common that they penalized bigger, more flexible models in some way. The model selection approaches utilized an information criterion that penalized more explanatory variables. The regularization techniques penalized the magnitude of the coefficients which had the effect of shrinking those coefficients toward 0.
We may want to compare the in-sample and out-of-sample errors across our different methods. We need to be a bit careful here because I allowed more explanatory variables in the model selection and averaging approaches than I did in the regularization approaches.
error %>%
dplyr::select(-group) %>%
kable(digits = 3) %>%
pack_rows(index = table(fct_inorder(error$group))) %>%
kable_styling()
method | in_sample | out_of_sample |
---|---|---|
Regression | ||
Intercept-only | 1.006 | 1.025 |
Main effects | 0.038 | 0.067 |
Interactions | 0.030 | 0.071 |
Selection | ||
Forward | 0.039 | 0.070 |
Backward | 0.031 | 0.069 |
Forward and backward | 0.037 | 0.067 |
Model averaging | ||
AIC | 0.038 | 0.067 |
Regularization | ||
Ridge | 0.039 | 0.067 |
LASSO | 0.039 | 0.068 |
Elastic net (alpha=0.5) | 0.038 | 0.067 |
One pattern is certainly clear: in-sample error is smaller than out-of-sample error.
In looking over these slides, please take note of when the training data were used compared to when the testing data were used. The testing data were only used at the very end to calculate out-of-sample error. The training data were used everywhere else including determining what explanatory variables to include and determining regularization penalties. For the regularization approaches, cross-validation was used within the training data to identify the penalty that minimized cross-validation error.
In all of the above approaches, we utilized computing heavily to repeatedly perform tasks, e.g.
Much of this has been built into the functions we are using, but, more generally, you should consider building your own functions to have the computer repeatedly perform similar tasks. This will make you more efficient and therefore more valuable.