R code

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
theme_set(theme_bw())
library("MASS")
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library("MuMIn")
## Warning: package 'MuMIn' was built under R version 4.2.3
library("glmnet")
## Warning: package 'glmnet' was built under R version 4.2.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7
library("knitr")
library("kableExtra")
## Warning: package 'kableExtra' was built under R version 4.2.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

These slides intend to introduce the notations of variable selection using information criteria and regularized regression techniques. Both of these methods have the idea of adding a penalty into the methodology and thus fall under an umbrella of penalized regression techniques.

For simplicity in the analyses, I will use a subset of the diamonds data set where we randomly select 100 observations and eliminate (for simplicity) the categorical variables.

set.seed(20230425)
n <- 100
d <- diamonds %>%
  dplyr::select(-cut, -color, -clarity) %>%
  rename(lprice = price) %>%
  mutate(lprice = log(lprice))

train <- d %>% sample_n(n)
test <- d %>% sample_n(n)

Let’s fit a couple of regression models to compare with our penalized approaches.

Intercept only

m <- lm(lprice ~ 1, data = train)
p <- predict(m, newdata = test)

error <- data.frame(
  group = "Regression",
  method = "Intercept-only",
  in_sample = mean(m$residuals^2),
  out_of_sample = mean((p - test$lprice)^2)
)

All main effects of the continuous variables.

m <- lm(lprice ~ ., data = train)
p <- predict(m, newdata = test)

error <- bind_rows(
  error,
  data.frame(
    group = "Regression",
    method = "Main effects",
    in_sample = mean(m$residuals^2),
    out_of_sample = mean((p - test$lprice)^2)
  )
)

Now includes all interactions

m <- lm(lprice ~ .^2, data = train)
p <- predict(m, newdata = test)

error <- bind_rows(
  error,
  data.frame(
    group = "Regression",
    method = "Interactions",
    in_sample = mean(m$residuals^2),
    out_of_sample = mean((p - test$lprice)^2)
  )
)

Variable selection

In regression models with many explanatory variables, there is a question of which explanatory variables should be included in a model. If, ultimately, we choose a single set, this is a model or variable selection problem.

We already know we can formally test two models that are nested.

m1 <- lm(lprice ~ ., data = train)
m2 <- lm(lprice ~ .^2, data = train)
anova(m1, m2)
## Analysis of Variance Table
## 
## Model 1: lprice ~ carat + depth + table + x + y + z
## Model 2: lprice ~ (carat + depth + table + x + y + z)^2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     93 3.8094                           
## 2     78 3.0168 15   0.79257 1.3661 0.1855

We can utilize this approach to search through models sequentially to determine which explanatory variables (and interactions) should be included in the final model.

R has a step() function that will automatically perform this process, although rather than utilizing the F-test and associated p-value it will use Akaike’s Information Criterion (AIC).

The general AIC formula is -2 times the log likelihood plus two times the number of \(\beta\)s (call this \(p\)). For linear regression the formula is \[ AIC: n\log(\hat\sigma^2) + 2p \] The \(2p\) is a penalty that attempts to reduce the number of parameters and, therefore explanatory variables, in the model. We can obtain AIC from a given model using the extractAIC() function:

extractAIC(m1)
## [1]    7.0000 -312.7697
extractAIC(m2)
## [1]   22.0000 -306.0963

The first value is \(p\) while the second value is AIC.

Since we are trying to minimize \(\sigma^2\) and minimize the number of parameters we are looking for models whose AIC is small. Thus, in this comparison AIC prefers model m1 due to its lower AIC.

An alternative to AIC is BIC which replaces the penalty by \(\log(n)\times p\). Since \(\log(n)>2\) for \(n > 7\), BIC generally suggests smaller models than AIC.

extractAIC(m1, k = log(n))
## [1]    7.0000 -294.5335
extractAIC(m2, k = log(n))
## [1]   22.0000 -248.7825

BIC prefers m1 since, again, it has the smaller BIC.

Forward

One approach is to start with the most basic model: one that only includes the intercept. Then try adding variables that improve (decrease) AIC.

m_forward <- step(lm(lprice ~ 1, data = train),
  scope = formula(lm(lprice ~ .^2, data = train)),
  direction = "forward"
)
## Start:  AIC=2.56
## lprice ~ 1
## 
##         Df Sum of Sq     RSS      AIC
## + x      1    95.846   4.714 -301.472
## + y      1    95.818   4.741 -300.882
## + z      1    95.114   5.446 -287.031
## + carat  1    89.333  11.226 -214.690
## + table  1     3.967  96.593    0.533
## <none>               100.560    2.558
## + depth  1     0.001 100.558    4.557
## 
## Step:  AIC=-301.47
## lprice ~ x
## 
##         Df Sum of Sq    RSS     AIC
## + carat  1   0.67252 4.0411 -314.87
## <none>               4.7136 -301.47
## + y      1   0.02839 4.6852 -300.08
## + z      1   0.00474 4.7088 -299.57
## + table  1   0.00068 4.7129 -299.49
## + depth  1   0.00036 4.7132 -299.48
## 
## Step:  AIC=-314.87
## lprice ~ x + carat
## 
##           Df Sum of Sq    RSS     AIC
## + carat:x  1  0.163321 3.8778 -316.99
## + z        1  0.115022 3.9261 -315.75
## <none>                 4.0411 -314.87
## + y        1  0.065853 3.9752 -314.51
## + depth    1  0.054045 3.9870 -314.21
## + table    1  0.016588 4.0245 -313.28
## 
## Step:  AIC=-316.99
## lprice ~ x + carat + x:carat
## 
##         Df Sum of Sq    RSS     AIC
## <none>               3.8778 -316.99
## + y      1 0.0230570 3.8547 -315.59
## + table  1 0.0131537 3.8646 -315.33
## + z      1 0.0037831 3.8740 -315.09
## + depth  1 0.0023037 3.8754 -315.05

Final model

summary(m_forward)$coef
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  2.5168764  0.6401572  3.931654 0.0001592841
## x            0.8404663  0.2118062  3.968092 0.0001397501
## carat        2.1539661  1.5317040  1.406255 0.1628771486
## x:carat     -0.2547876  0.1267100 -2.010792 0.0471520524

Training and testing error

p <- predict(m_forward, newdata = test)

error <- bind_rows(
  error,
  data.frame(
    group = "Selection",
    method = "Forward",
    in_sample = mean(m_forward$residuals^2),
    out_of_sample = mean((p - test$lprice)^2)
  )
)

Backward

Another approach is to start with the largest model and eliminate variables that (when eliminated) decrease AIC.

m_backward <- step(lm(lprice ~ .^2, data = train),
  direction = "backward"
)
## Start:  AIC=-306.1
## lprice ~ (carat + depth + table + x + y + z)^2
## 
##               Df Sum of Sq    RSS     AIC
## - carat:x      1  0.000996 3.0178 -308.06
## - depth:y      1  0.002573 3.0194 -308.01
## - carat:depth  1  0.006399 3.0232 -307.88
## - x:y          1  0.008296 3.0251 -307.82
## - depth:z      1  0.009201 3.0260 -307.79
## - y:z          1  0.010314 3.0271 -307.75
## - carat:z      1  0.012369 3.0292 -307.69
## - depth:x      1  0.018539 3.0354 -307.48
## - carat:table  1  0.034585 3.0514 -306.96
## - carat:y      1  0.058998 3.0758 -306.16
## <none>                     3.0168 -306.10
## - x:z          1  0.069490 3.0863 -305.82
## - table:y      1  0.107191 3.1240 -304.61
## - table:x      1  0.179468 3.1963 -302.32
## - table:z      1  0.179820 3.1967 -302.31
## - depth:table  1  0.225851 3.2427 -300.88
## 
## Step:  AIC=-308.06
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table + 
##     carat:y + carat:z + depth:table + depth:x + depth:y + depth:z + 
##     table:x + table:y + table:z + x:y + x:z + y:z
## 
##               Df Sum of Sq    RSS     AIC
## - depth:y      1  0.001585 3.0194 -310.01
## - depth:z      1  0.013114 3.0309 -309.63
## - y:z          1  0.017423 3.0353 -309.49
## - x:y          1  0.018038 3.0359 -309.47
## - depth:x      1  0.022474 3.0403 -309.32
## - carat:depth  1  0.028940 3.0468 -309.11
## - carat:table  1  0.033982 3.0518 -308.94
## <none>                     3.0178 -308.06
## - x:z          1  0.070127 3.0880 -307.77
## - carat:z      1  0.070772 3.0886 -307.75
## - carat:y      1  0.072388 3.0902 -307.69
## - table:y      1  0.106413 3.1242 -306.60
## - table:z      1  0.183291 3.2011 -304.17
## - table:x      1  0.184756 3.2026 -304.12
## - depth:table  1  0.230872 3.2487 -302.69
## 
## Step:  AIC=-310.01
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table + 
##     carat:y + carat:z + depth:table + depth:x + depth:z + table:x + 
##     table:y + table:z + x:y + x:z + y:z
## 
##               Df Sum of Sq    RSS     AIC
## - depth:z      1  0.015162 3.0346 -311.51
## - y:z          1  0.017085 3.0365 -311.45
## - x:y          1  0.019696 3.0391 -311.36
## - depth:x      1  0.030052 3.0495 -311.02
## - carat:depth  1  0.031525 3.0509 -310.97
## - carat:table  1  0.039606 3.0590 -310.71
## <none>                     3.0194 -310.01
## - x:z          1  0.075054 3.0945 -309.56
## - carat:z      1  0.076518 3.0959 -309.51
## - carat:y      1  0.078231 3.0976 -309.45
## - table:y      1  0.107596 3.1270 -308.51
## - table:z      1  0.183557 3.2030 -306.11
## - table:x      1  0.193039 3.2125 -305.81
## - depth:table  1  0.230651 3.2501 -304.65
## 
## Step:  AIC=-311.51
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table + 
##     carat:y + carat:z + depth:table + depth:x + table:x + table:y + 
##     table:z + x:y + x:z + y:z
## 
##               Df Sum of Sq    RSS     AIC
## - y:z          1  0.002620 3.0372 -313.42
## - carat:depth  1  0.016612 3.0512 -312.96
## <none>                     3.0346 -311.51
## - x:y          1  0.061645 3.0962 -311.50
## - carat:z      1  0.063135 3.0977 -311.45
## - depth:x      1  0.064278 3.0989 -311.41
## - carat:y      1  0.064680 3.0993 -311.40
## - carat:table  1  0.069263 3.1038 -311.25
## - x:z          1  0.094890 3.1295 -310.43
## - table:y      1  0.099180 3.1338 -310.29
## - table:z      1  0.168595 3.2032 -308.10
## - table:x      1  0.178070 3.2126 -307.81
## - depth:table  1  0.217293 3.2519 -306.59
## 
## Step:  AIC=-313.42
## lprice ~ carat + depth + table + x + y + z + carat:depth + carat:table + 
##     carat:y + carat:z + depth:table + depth:x + table:x + table:y + 
##     table:z + x:y + x:z
## 
##               Df Sum of Sq    RSS     AIC
## - carat:depth  1  0.014521 3.0517 -314.95
## <none>                     3.0372 -313.42
## - carat:z      1  0.061895 3.0991 -313.41
## - carat:y      1  0.063297 3.1005 -313.36
## - depth:x      1  0.064492 3.1017 -313.32
## - carat:table  1  0.084911 3.1221 -312.67
## - x:y          1  0.110684 3.1479 -311.84
## - x:z          1  0.113229 3.1504 -311.76
## - table:y      1  0.122732 3.1599 -311.46
## - table:x      1  0.188291 3.2255 -309.41
## - table:z      1  0.191552 3.2287 -309.31
## - depth:table  1  0.254134 3.2913 -307.39
## 
## Step:  AIC=-314.95
## lprice ~ carat + depth + table + x + y + z + carat:table + carat:y + 
##     carat:z + depth:table + depth:x + table:x + table:y + table:z + 
##     x:y + x:z
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     3.0517 -314.95
## - carat:table  1   0.08217 3.1339 -314.29
## - carat:y      1   0.12853 3.1802 -312.82
## - carat:z      1   0.12892 3.1806 -312.81
## - depth:x      1   0.16434 3.2161 -311.70
## - table:y      1   0.18633 3.2380 -311.02
## - x:y          1   0.19593 3.2476 -310.72
## - x:z          1   0.20241 3.2541 -310.52
## - table:x      1   0.21190 3.2636 -310.23
## - table:z      1   0.25652 3.3082 -308.88
## - depth:table  1   0.34762 3.3993 -306.16

Final model

summary(m_backward)$coef
##                Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 693.9954125 198.98233863  3.487724 0.0007817938
## carat        16.6985110  10.47722252  1.593792 0.1147839305
## depth       -10.8918398   3.27313133 -3.327651 0.0013070183
## table       -10.6526813   3.28964993 -3.238242 0.0017299373
## x           -69.2748537  25.48698778 -2.718048 0.0079928073
## y           -65.1820052  22.93834218 -2.841618 0.0056463589
## z           184.1495785  62.99661772  2.923166 0.0044642064
## carat:table  -0.2512743   0.16808215 -1.494949 0.1387202199
## carat:y      -5.9414008   3.17774681 -1.869690 0.0650529439
## carat:z       9.2349531   4.93182748  1.872522 0.0646556918
## depth:table   0.1670259   0.05432095  3.074796 0.0028512277
## depth:x       0.2678994   0.12671818  2.114136 0.0375046461
## table:x       0.9508779   0.39608467  2.400693 0.0185997380
## table:y       0.8123051   0.36083522  2.251180 0.0270170333
## table:z      -2.7059509   1.02445321 -2.641361 0.0098652842
## x:y           4.2572933   1.84422767  2.308442 0.0234620396
## x:z          -6.9452919   2.96011494 -2.346291 0.0213454207

Training and testing error

p <- predict(m_backward, newdata = test)

error <- bind_rows(
  error,
  data.frame(
    group = "Selection",
    method = "Backward",
    in_sample = mean(m_backward$residuals^2),
    out_of_sample = mean((p - test$lprice)^2)
  )
)

Forward and backward

Or we can traverse in both directions starting from somewhere

m_both <- step(lm(lprice ~ ., data = train),
  scope = formula(lm(lprice ~ .^2, data = train)),
  direction = "both"
)
## Start:  AIC=-312.77
## lprice ~ carat + depth + table + x + y + z
## 
##               Df Sum of Sq    RSS     AIC
## + depth:table  1   0.15650 3.6529 -314.96
## - table        1   0.00219 3.8116 -314.71
## - x            1   0.00285 3.8123 -314.69
## - y            1   0.00388 3.8133 -314.67
## - depth        1   0.06652 3.8759 -313.04
## <none>                     3.8094 -312.77
## - z            1   0.09535 3.9048 -312.30
## + carat:x      1   0.04561 3.7638 -311.97
## + depth:z      1   0.03887 3.7705 -311.80
## + carat:z      1   0.03178 3.7776 -311.61
## + carat:y      1   0.02195 3.7875 -311.35
## + table:z      1   0.01397 3.7954 -311.14
## + depth:y      1   0.01376 3.7956 -311.13
## + depth:x      1   0.01263 3.7968 -311.10
## + table:y      1   0.00599 3.8034 -310.93
## + table:x      1   0.00587 3.8035 -310.92
## + x:z          1   0.00437 3.8050 -310.88
## + carat:depth  1   0.00411 3.8053 -310.88
## + carat:table  1   0.00309 3.8063 -310.85
## + x:y          1   0.00082 3.8086 -310.79
## + y:z          1   0.00031 3.8091 -310.78
## - carat        1   0.86560 4.6750 -294.29
## 
## Step:  AIC=-314.96
## lprice ~ carat + depth + table + x + y + z + depth:table
## 
##               Df Sum of Sq    RSS     AIC
## - x            1   0.00119 3.6541 -316.93
## - y            1   0.00325 3.6562 -316.88
## <none>                     3.6529 -314.96
## - z            1   0.08603 3.7389 -314.64
## + carat:x      1   0.03018 3.6227 -313.79
## + x:z          1   0.01586 3.6370 -313.40
## + carat:z      1   0.01486 3.6380 -313.37
## + carat:y      1   0.01221 3.6407 -313.30
## + depth:x      1   0.01205 3.6409 -313.30
## + depth:y      1   0.01047 3.6424 -313.25
## + x:y          1   0.00998 3.6429 -313.24
## + table:y      1   0.00552 3.6474 -313.12
## + table:x      1   0.00500 3.6479 -313.10
## + table:z      1   0.00403 3.6489 -313.08
## + y:z          1   0.00204 3.6509 -313.02
## + depth:z      1   0.00155 3.6514 -313.01
## + carat:depth  1   0.00147 3.6514 -313.01
## + carat:table  1   0.00021 3.6527 -312.97
## - depth:table  1   0.15650 3.8094 -312.77
## - carat        1   0.88162 4.5345 -295.34
## 
## Step:  AIC=-316.93
## lprice ~ carat + depth + table + y + z + depth:table
## 
##               Df Sum of Sq    RSS     AIC
## - y            1   0.00253 3.6566 -318.86
## <none>                     3.6541 -316.93
## + carat:y      1   0.00904 3.6451 -315.18
## + carat:z      1   0.00673 3.6474 -315.12
## + table:y      1   0.00408 3.6500 -315.04
## + depth:y      1   0.00320 3.6509 -315.02
## + table:z      1   0.00285 3.6512 -315.01
## + carat:depth  1   0.00197 3.6521 -314.99
## + y:z          1   0.00142 3.6527 -314.97
## + x            1   0.00119 3.6529 -314.96
## + depth:z      1   0.00050 3.6536 -314.95
## + carat:table  1   0.00003 3.6541 -314.93
## - depth:table  1   0.15816 3.8123 -314.69
## - z            1   0.18660 3.8407 -313.95
## - carat        1   0.89011 4.5442 -297.13
## 
## Step:  AIC=-318.86
## lprice ~ carat + depth + table + z + depth:table
## 
##               Df Sum of Sq     RSS     AIC
## <none>                      3.6566 -318.86
## + carat:depth  1    0.0043  3.6524 -316.98
## + carat:z      1    0.0033  3.6534 -316.95
## + y            1    0.0025  3.6541 -316.93
## + depth:z      1    0.0023  3.6543 -316.93
## + table:z      1    0.0014  3.6552 -316.90
## + x            1    0.0005  3.6562 -316.88
## + carat:table  1    0.0001  3.6566 -316.86
## - depth:table  1    0.1582  3.8149 -316.63
## - carat        1    0.8887  4.5453 -299.11
## - z            1    7.1337 10.7903 -212.65

Final model

summary(m_both)$coef
##                Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 57.07953828 24.749795370  2.306263 2.329334e-02
## carat       -1.09504132  0.229098703 -4.779780 6.467710e-06
## depth       -0.91194731  0.404259186 -2.255848 2.639880e-02
## table       -0.84991422  0.420502725 -2.021186 4.610443e-02
## z            2.12046543  0.156584802 13.541962 8.202012e-24
## depth:table  0.01386812  0.006875878  2.016924 4.655592e-02

As can be seen from this example, there is no reason that these approaches will lead to the same model.

Training and testing error

p <- predict(m_both, newdata = test)

error <- bind_rows(
  error,
  data.frame(
    group = "Selection",
    method = "Forward and backward",
    in_sample = mean(m_both$residuals^2),
    out_of_sample = mean((p - test$lprice)^2)
  )
)

All subsets

There is no guarantee that using a stepwise approach will lead to the model with the best AIC (or any other criterion). Above we saw that 3 different approaches led to 3 different models.

An alternative to performing a stepwise approach to model selection is to calculate AIC for all models and then choose the model that has the lowest AIC. The number of models we need to consider is the number of explanatory variables raised to the second power (since every explanatory variable can be either in or out of the model). When the number of explanatory variables is small, this is possible. As the number of explanatory variables increases, this may be computationally infeasible.

data.frame(p = 1:20) %>%
  mutate(`Number of models` = p^2) %>%
  ggplot(aes(x = p, y = `Number of models`)) +
  geom_line()

When you include interactions, the number of possible interactions between two explanatory variables is the number of explanatory variables choose 2.

data.frame(p = 1:20) %>%
  mutate(`Number of interactions` = choose(p,2)) %>%
  ggplot(aes(x = p, y = `Number of interactions`)) +
  geom_line()

When we are considering all subsets we can consider including or excluding these interactions as well. If all interactions between two variables to be included or excluded, then we have p squared plus p choose 2 squared possible models.

data.frame(p = 1:20) %>%
  mutate(`Number of models` = p^2 + choose(p,2)^2) %>%
  ggplot(aes(x = p, y = `Number of models`)) +
  geom_line()

But of course we could consider higher order interactions as well.

The bottom line is that even for a small number of explanatory variables, there are a large number of possible models.

Model averaging

Rather than selecting a single model, we can utilize all models and give each model a probability. Then, for each test observation, we can average our predictions across all the models.

m_avg <- lm(lprice ~ ., data = train, na.action = na.fail) %>%
  MuMIn::dredge()
## Fixed term is "(Intercept)"

Obtain predictions.

mp <- m_avg %>%
  get.models(subset = cumsum(weight) <= .99) %>%
  model.avg()

p_train <- predict(mp, newdata = train)
p_test <- predict(mp, newdata = test)

error <- bind_rows(
  error,
  data.frame(
    group = "Model averaging",
    method = "AIC",
    in_sample = mean((p_train - train$lprice)^2),
    out_of_sample = mean((p_test - test$lprice)^2)
  )
)

Regularized regression

Recall that a multiple regression model can be written \[ Y = X\beta + \epsilon, \quad \epsilon \sim N(0,\sigma^2,\mathrm{I}_n) \] where

  • \(Y\) is \(n\times 1\) vector of response variable values,
  • \(X\) is \(n\times p\) matrix of explanatory variable values,
  • \(\beta\) is \(p\times 1\) coefficient vector, and
  • \(\mathrm{I}_n\) is an \(n\times n\) identity matrix.

The MLE for \(\beta\) is \[ \hat\beta_{MLE} = [X^\top X]^{-1} X^\top y. \]

This MLE is the solution to the formula \[ \hat\beta_{MLE} = \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta), \] i.e. the sum of squared residuals.

In this section, we will be using the glmnet package which takes as input the response vector and the explanatory variable matrix.

train <- list(
  y = train$lprice,
  x = subset(train, select = -c(lprice)) %>% as.matrix()
)

test <- list(
  y = test$lprice,
  x = subset(test, select = -c(lprice)) %>% as.matrix()
)

Ridge regression

For ridge regression, we add a penalty to the formula above. Specifically, the ridge regression estimator is the solution to \[ \hat\beta_{ridge} = \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta)/2n + \lambda \sum_{j} \beta_j^2 \] where the penalty is \(\lambda \sum_{j} \beta_j^2\) for some value of \(\lambda>0\). Note that this corresponds to the glmnet definition and may differ from other sources, but the idea is the same.

Since we are trying to find the minimum and this penalty sums the square of the \(\beta\)s, the solution to this formula will have \(\beta\) closer to zero. Thus, this method is one of the shrinkage methods.

The solution is available in closed form as \[ \hat\beta_{ridge} = [X^\top X + \lambda\mathrm{I}_p]^{-1} X^\top y. \]

We can fit a ridge regression model using the MASS::lm.ridge() function.

Let’s utilize the diamonds data set, but, for simplicity, only include the continuous variables.

Fit ridge regression

m <- glmnet(
  x = train$x,
  y = train$y,
  alpha = 0 # ridge regression
)

Shrinkage

The parameter estimates shrink toward zero as the ridge penalty increase as can be seen in this plot.

plot(m, xvar = "lambda", label = TRUE)

Note that all coefficients estimates converge to zero (although not monotonically), but are never equal to zero.

Choosing penalty

Estimation of a ridge regression model involves choosing a value for \(\lambda\).

cv_m <- cv.glmnet(
  x = train$x,
  y = train$y,
  alpha = 0, # ridge regression

  # default sequence used values of lambda that are too big
  lambda = seq(0, 0.01, length = 100)
)
plot(cv_m)

Choose \(\lambda\) that minimizes cross-validation error in the training set.

best_lambda <- cv_m$lambda.min
m <- glmnet(
  x = train$x,
  y = train$y,
  alpha = 0, # ridge regression
  lambda = best_lambda
)

In and out of sample error

p_train <- predict(m, newx = train$x)
p_test <- predict(m, newx = test$x)

error <- bind_rows(
  error,
  data.frame(
    group         = "Regularization",
    method        = "Ridge",
    in_sample     = mean((p_train - train$y)^2),
    out_of_sample = mean((p_test - test$y)^2)
  )
)

When I performed this analysis using MASS::lm.ridge my in and out of sample error was almost exactly the same as the other methods. So I’m not sure what is going on with the glmnet implementation of ridge regression.

Bayesian interpretation

There is a Bayesian interpretation of the ridge regression as a normal prior on the coefficients centered at 0. \[ Y \sim N(X\beta, \sigma^2 \mathrm{I}_n), \quad \beta \sim N\left(0, \lambda^{-1}\mathrm{I}_p\right) \]

LASSO

For least absolute shrinkage and selection operator (LASSO), we add a penalty to the lm formula above. Specifically, the LASSO estimate is the solution to \[ \hat\beta_{LASSO} = \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta)/2n + \lambda \sum_{j} |\beta_j| \] where the penalty is \(\lambda \sum_{j} |\beta_j|\) for some value of \(\lambda>0\). Note that this corresponds to the glmnet definition and may differ from other sources, but the idea is the same.

Since we are trying to find the minimum and this penalty sums the square of the \(\beta\)s, the solution to this formula will have \(\beta\) closer to zero. Thus, this method is one of the shrinkage methods.

The penalty here is very similar to the ridge penalty, but here we use the absolute value rather than the squared coefficient.

Shrinkage

The LASSO point estimate cannot be written analytically, but we have reasonable algorithms to obtain the estimate.

m <- glmnet(
  x = train$x,
  y = train$y,
  alpha = 1 # LASSO
)

Estimated coefficients as a function of the penalty \(\lambda\).

plot(m, xvar = "lambda", label = TRUE)

Note that some of the point estimates are exactly zero, unlike the ridge analysis. Thus, LASSO can perform model selection. With that being said, this is only for the point estimate and does not incorporate uncertainty on the point estimate.

Choosing penalty

We still need to decide on a penalty to use. We can choose based on cross-validation.

cv_m <- cv.glmnet(
  x = train$x,
  y = train$y,
  alpha = 1 # LASSO
)
plot(cv_m)

Choose \(\lambda\) that minimizes cross-validation error in the training set.

best_lambda <- cv_m$lambda.min
m <- glmnet(
  x = train$x,
  y = train$y,
  alpha = 1, # LASSO
  lambda = best_lambda
)

In and out of sample error

p_train <- predict(m, newx = train$x)
p_test <- predict(m, newx = test$x)

error <- bind_rows(
  error,
  data.frame(
    group         = "Regularization",
    method        = "LASSO",
    in_sample     = mean((p_train - train$y)^2),
    out_of_sample = mean((p_test - test$y)^2)
  )
)

Bayesian interpretation

There is a Bayesian interpretation of the LASSO as a Laplace (or double exponential) prior on the coefficients centered at 0.

Elastic net

The elastic net is a combination of the ridge and LASSO approaches. As parameterized by glmnet(), the penalty is (I think)

\[ \hat\beta_{LASSO} = \mbox{argmin}_{\beta}\, (Y-X\beta)^\top (Y-X\beta) + \lambda \left[ (1-\alpha) \sum_j \beta_j^2 + \alpha \sum_{j} |\beta_j| \right] \]

Thus, when \(\alpha = 1\) we have LASSO while if \(\alpha=0\) we have ridge regression. Values of \(\alpha\) between 0 and 1 are a mixture between the two.

cv_m <- cv.glmnet(
  x = train$x,
  y = train$y,
  alpha = 0.5 # LASSO
)

Choose \(\lambda\) that minimizes cross-validation error in the training set.

best_lambda <- cv_m$lambda.min
m <- glmnet(
  x = train$x,
  y = train$y,
  alpha = 0.5, # LASSO
  lambda = best_lambda
)

In and out of sample error

p_train <- predict(m, newx = train$x)
p_test <- predict(m, newx = test$x)

error <- bind_rows(
  error,
  data.frame(
    group         = "Regularization",
    method        = "Elastic net (alpha=0.5)",
    in_sample     = mean((p_train - train$y)^2),
    out_of_sample = mean((p_test - test$y)^2)
  )
)

Summary

In these slides, we introduced a number of different approaches within the context of regression for deciding what explanatory variables to include as well as estimating the coefficients for included explanatory variables. All methods had in common that they penalized bigger, more flexible models in some way. The model selection approaches utilized an information criterion that penalized more explanatory variables. The regularization techniques penalized the magnitude of the coefficients which had the effect of shrinking those coefficients toward 0.

Comparison

We may want to compare the in-sample and out-of-sample errors across our different methods. We need to be a bit careful here because I allowed more explanatory variables in the model selection and averaging approaches than I did in the regularization approaches.

error %>%
  dplyr::select(-group) %>%
  kable(digits = 3) %>%
  pack_rows(index = table(fct_inorder(error$group))) %>%
  kable_styling()
method in_sample out_of_sample
Regression
Intercept-only 1.006 1.025
Main effects 0.038 0.067
Interactions 0.030 0.071
Selection
Forward 0.039 0.070
Backward 0.031 0.069
Forward and backward 0.037 0.067
Model averaging
AIC 0.038 0.067
Regularization
Ridge 0.039 0.067
LASSO 0.039 0.068
Elastic net (alpha=0.5) 0.038 0.067

One pattern is certainly clear: in-sample error is smaller than out-of-sample error.

Test v train

In looking over these slides, please take note of when the training data were used compared to when the testing data were used. The testing data were only used at the very end to calculate out-of-sample error. The training data were used everywhere else including determining what explanatory variables to include and determining regularization penalties. For the regularization approaches, cross-validation was used within the training data to identify the penalty that minimized cross-validation error.

Computing

In all of the above approaches, we utilized computing heavily to repeatedly perform tasks, e.g.

  • calculating AIC for a number of models
  • estimating parameters with different penalties

Much of this has been built into the functions we are using, but, more generally, you should consider building your own functions to have the computer repeatedly perform similar tasks. This will make you more efficient and therefore more valuable.