R code

library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("Sleuth3")
library("knitr")
library("kableExtra")
## Warning: package 'kableExtra' was built under R version 4.2.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

What is machine learning?

From IBM:

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

and later

machine learning algorithms are used to make a prediction or classification

For the purpose of this class, we will focus on the area of ML that tries to make a prediction for a continuous variable and classification of a binary variable.

The fields of ML and of statistics overlap quite a bit. IMO, the primary distinction between the two are that

  • ML prioritizes speed and predictive performance while
  • statistics prioritizes interpretation and theoretical foundation.

While the fields overlap, they utilize a whole different terminology. Here are some synonyms across the fields.

There are a lot of synonyms between statistics and machine learning. Here is a subset:

For this discussion, we’ll stick with (my version) of the statistics terminology.

Speed

Both ML approaches and statistical approaches are based on algorithms. This is typically measured using big O notation. Big O is a measure of how long it takes to run the code as a function one of the arguments in O. For example, to obtain the MLEs for multiple regression, we need to compute \[ \hat\beta = [X^\top X]^{-1}X^\top y. \] Although, there are more efficient ways to perform these computations the basic approach requires the following calculations and their complexities

  • matrix product \(X^\top X\) with complexity \(O(p^2n)\),
  • matrix-vector product \(X^\top y\) with complexity \(O(pn)\), and
  • inverse \((X^\top X)^{-1}\) with complexity \(O(p^3)\).

Predictive performance

A key notion in evaluating prediction is that you need to actually be making predictions. A common approach is to construct a train-test data set while another approach is cross-validation.

Train-test data

A common approach is to split your data into two groups: one called training and the other called testing. You then estimate a model based on the training data, but evaluate the model based on your testing data.

set.seed(20230424)

n <- 1000
x <- rnorm(n)
y <- rnorm(n, x)
d <- data.frame(y = y, x = x)

# Split data
p <- 0.5 # split 50/50 (on average)
b <- sample(c(FALSE,TRUE), size = n*p, replace = TRUE)

train <- d[ b,]
test  <- d[!b,]

# Estimate
m <- lm(y~x, data = train)
mean(residuals(m)^2) # in-sample error
## [1] 1.003538
# Predict
p <- test %>% mutate(yhat = predict(m, newdata = test))

# Evaluate
mean((p$y - p$yhat)^2) # out-of-sample error
## [1] 1.059443

Generally, your out-of-sample error will be higher than your in-sample error. If your out-of-sample error is much larger than your in-sample error, you have likely overfit your model.

Typically, you will be comparing methods based on this out-of-sample evaluation.

Cross-validation

An alternative to the train-test split is \(k\)-fold cross-validation. In this approach, you split your data into \(k\)-folds. Then you leave out one fold at a time, estimate your model based on the remaining data, and then use the left out fold as the test data. Repeating this process across all the folds, provides an estimate of the out-of-sample error as well as the variability in this error.

# k folds
k <- 10
fold  <- sample(k, size = n, replace = TRUE)
error <- data.frame(
  in_sample     = numeric(k),
  out_of_sample = numeric(k)
)

# iterate over the folds
for (i in 1:k) {
  train <- d[fold != i,]
  test  <- d[fold == i,]
  
  m <- lm(y ~ x, data = train)
  error$in_sample[i] <- mean(residuals(m)^2)
  
  p <- test %>% mutate(yhat = predict(m, newdata = test))
  error$out_of_sample[i] <- mean((p$y - p$yhat)^2) 
}

error
##    in_sample out_of_sample
## 1   1.010208     1.2166871
## 2   1.008662     1.1851497
## 3   1.015991     1.1427896
## 4   1.030771     1.0079976
## 5   1.012005     1.1636154
## 6   1.044909     0.8446286
## 7   1.023682     1.0734821
## 8   1.045346     0.8632544
## 9   1.055374     0.8070023
## 10  1.036258     0.9721119
mean(error$out_of_sample)
## [1] 1.027672
sd(  error$out_of_sample) 
## [1] 0.1516811

A special case of \(k\)-fold cross-validation is leave-one-out cross-validation where \(k\) is equal to the sample size.

Interpretation

(Statistical) regression models are appealing due to their (relatively) easy interpretation. For example, consider Sleuth3::ex0722

ggplot(Sleuth3::ex0722, aes(Height, Force, color = Species)) + 
  geom_point() + 
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

m <- lm(Force ~ Height*Species, data = Sleuth3::ex0722)
summary(m)
## 
## Call:
## lm(formula = Force ~ Height * Species, data = Sleuth3::ex0722)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1458 -2.0996 -0.5013  1.8407 13.0937 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -9.0043     7.1472  -1.260 0.216843    
## Height                               2.6799     0.6744   3.973 0.000377 ***
## SpeciesHemigrapsus nudus            12.1644     8.6410   1.408 0.168846    
## SpeciesLophopanopeus bellus         -8.2482     9.3941  -0.878 0.386476    
## Height:SpeciesHemigrapsus nudus     -2.5351     0.8990  -2.820 0.008180 ** 
## Height:SpeciesLophopanopeus bellus   1.1688     0.9888   1.182 0.245929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.445 on 32 degrees of freedom
## Multiple R-squared:  0.788,  Adjusted R-squared:  0.7549 
## F-statistic: 23.79 on 5 and 32 DF,  p-value: 6.419e-10

For species Cancer productus, each additional mm increase in claw propodus height is associated with a 2.7 N increase in claw closing strength.

This interpretability is due to the relatively simple nature of a regression model. This simplicity generally results in worse predictive power compared to a more flexible model.

Theoretical foundation

Statistical methods are often built on a probability model which provides

  • diagnostics,
  • point estimator guarantees, and
  • interval estimator guarantees.

Even when statistical methods are not built on a probability model, e.g. nonparametrics, the methods still provide guarantees based on other theoretical results, e.g. CLT.

Diagnostics are important to understand when your method is irrelevant.

Prediction

In prediction, we are interested in forecasting a continuous response variable. So we are in the realm of multiple regression.

Point prediction

For observation \(i\) in the testing data set, let

  • \(Y_i\) be the observed value for the response variable and
  • \(\widehat{Y}_i\) be the predicted value for the response variable.

We have a variety of statistics we can calculate and therefore use to evaluate method(s).

MSE

The statistic mentioned earlier is mean squared error (MSE): \[ MSE = \frac{1}{n}\sum_{i=1}^n (Y_i-\widehat{Y}_i)^2. \]

An alternative to MSE is root MSE (RMSE): \[ RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n (Y_i-\widehat{Y}_i)^2}. \]

MAD

Rather than squared error, we can use absolute error and calculate mean absolute deviation (MAD): \[ MAD = \frac{1}{n}\sum_{i=1}^n |Y_i-\widehat{Y}_i|. \]

MAPE

If our observed \(Y_i\) can be close to zero, we may want to weight errors so that the same magnitude error is more important when the observed value of \(Y_i\) is close to zero. In this case, we can calculate mean absolute percentage error (MAPE): \[ MAPE = \frac{1}{n}\sum_{i=1}^n \left|\frac{Y_i-\widehat{Y}_i}{Y_i}\right|. \] Note that the error is now calculated as a percentage of the observed value.

R^2

The coefficient of determination for a regression of the predicted values on the observed values (or vice versa since the result is the same) is often used. I’m not a big fan of this measure as correlation can often be good when the actual predictions are far off. For example,

y    <- runif(n)
yhat <- y*10 + rnorm(n)
cor(y,yhat)^2
## [1] 0.8996455
plot(yhat, y, ylim = c(0,10), xlim = c(0,10))
abline(0, 1, col = "red")

Under/overfitting

When making predictions, there is a trade-off between training a model so that it matches the training data well versus matching the testing data well. If the model doesn’t match either the training or testing data well, the model is underfit, If the model matches the training data well, but does not match the testing data well, then the model is overfit.

To demonstrate the phenomenon, we construct a linear regression model where the first 10 explanatory variables are related to the response but with decreasing relevant. Additional explanatory variables are availabe but are unrelated to the response.

We then consider a sequence of regression models that increasingly adds a single explanatory variable. For each regression model fit, we calculate the out-of-sample MSE and plot versus the number of explanatory variables in the model.

set.seed(20230423)
n <- 100
p <- 50 

train <- matrix(rnorm(n*p), nrow = n, ncol = p) %>%
  as.data.frame() %>%
  mutate(y = V1 + V2 + V3 + V4 + V5 + rnorm(n)) %>%
  select(y, everything())

test <- matrix(rnorm(n*p), nrow = n, ncol = p) %>%
  as.data.frame() %>%
  mutate(y = V1 + V2 + V3 + V4 + V5 + rnorm(n))  %>%
  select(y, everything())

mse <- numeric(p)
for (i in 1:p) {
  m <- lm(y ~ ., data = train[,1:(i+1)])
  yhat <- predict(m, newdata = test)
  mse[i] <- sqrt(mean(test$y - yhat)^2)
}

plot(mse, type = "l")
abline(v = 5, col = "gray")

Probabilistic predictions

The prediction evaluations above are all evaluations of a point prediction. When using statistical approaches, we generally have an ability to quantify uncertainty in our predictions. Typically this is through probability density function for our observations. For example, predictions from a multiple regression model are given as a normal distribution. Rather than simply evaluating the mean of that normal distribution, we can evaluate the uncertainty associated with that predictive density.

Density evaluation

A simple way to evaluate a predictive density is simply to evaluate the density at the observed data point \(Y_i\). For example, if \(\widehat{f}_i(\cdot)\) is our predictive density for observation \(i\), then \(\widehat{f}_i(y_i)\) is the density of our observation. If the observations are independent, then it makes sense to take the product of these evaluations \[ p(y) = \prod_{i=1}^n \widehat{f}_i(y_i). \] (Note that if the observations are not independent, then we should evaluate the joint density of the test set observations.)

Often, we will use the logarithm of this product: \[ \log p(y) = \log\left( \prod_{i=1}^n \widehat{f}_i(y_i) \right) = \sum_{i=1}^n \log \widehat{f}_i(y_i). \]

Prediction interval evaluation

Sometimes we don’t have a full predictive density, but instead we have confidence intervals. There are two aspects of confidence intervals we may want to evaluate

  • coverage
  • width

Ideally, we have the shortest width intervals that have at least the nominal coverage, e.g. 95%.

Classification

Classification is prediction for a categorical variable. There are two main types of classification: binary and multiclass. As the names suggestion, binary classification is predicting observations into one of two class while multiclass classification is predicting observations into more than two classes.

Binary

For this discussion, assume the two binary classes are 0 and 1. As before, let the true value for observation \(i\) be \(Y_i\). We build a classifier that will produce a prediction \(\widehat{Y}_i\) that is 0 or 1 and may produce a probability \(\widehat{p}_i\) that the observation is a 1.

Confusion matrix

With our test data and our predictions, we can compute a confusion matrix which describes the accuracy of our predictions.

d <- data.frame(
  observed = c(0,1),
  predicted = c(0,1,1,0),
  number = c(84, 97, 17, 23)
) 

d %>% 
  pivot_wider(names_from = predicted, values_from = number) %>%
  knitr::kable(caption = "Confusion matrix") %>%
  add_header_above(c(" " = 1, "predicted" = 2))
Confusion matrix
predicted
observed 0 1
0 84 17
1 23 97

From this confusion matrix, we can calculate a number of quantities of interest:

TP <- d %>% filter(observed == 1, predicted == 1) %>% pull(number) # True positives
TN <- d %>% filter(observed == 0, predicted == 0) %>% pull(number) # True negatives
FP <- d %>% filter(observed == 0, predicted == 1) %>% pull(number) # False positives
FN <- d %>% filter(observed == 1, predicted == 0) %>% pull(number) # False negatives

allP <- TP + FN
allN <- TN + FP

Quantities that are often reported are

(TP + TN) / (allP + allN) # Classification accuracy
## [1] 0.8190045
TP / allP # Sensitivity (true positive rate)
## [1] 0.8083333
TN / allN # Specificity (true negative rate)
## [1] 0.8316832

Log loss

The confusion matrix approach has the downside that you force an algorithm to choose between the two class. The algorithm itself may be very confident in some observations but not so confident for others. To incorporate this uncertainty, we can utilize the probabilities reported by the algorithm for the different classes.

When we have probabilities we can calculate the log loss on the test data set \[ \mbox{log loss} = -\sum_{i=1}^n y_i \log(\widehat{p}_i) + (1-y_i) \log(1-\widehat{p}_i). \] Since each observation is either \(y_i = 0\) or \(y_i = 1\), only one of the two terms will get added to the sum for a particular observation. When \(y_i=1\), we add \(\log(\widehat{p}_i)\) which we hope is close to zero and when \(y_i=0\), we add \(\log(1-\widehat{p}_i)\) which, again, we hope is close to zero. Since the logarithm of a number between 0 and 1 is always negative, we take a negative out front so that small numbers are better.

Multiclass

When we have multiclass classification, we have an expanded confusion matrix and therefore more quantities we can calculate. For the log loss, we simply add the logarithms of the class probabilities for the observed classes and multiply by -1 at the end, .

Issues

Best method

I see so many papers and (graduate) students who evaluate a bunch of machine learning approaches on (I hope) as test data set and then claim “this method is the best!”. Oddly enough each student suggests a different method is the best and often times the same student suggests different methods are better for different purposes.

Thus, I conclude that machine learning methods are essentially equivalent in their predictive power and thus exchangeable. With that being said, a colleague has suggested to me that extreme gradient boosting is the best method out there.

What seems much more important to me is “feature extraction”, i.e. what explanatory variables and transformations of those explanatory variables are you going to allow in your model.

Ethical considerations

Bias

Prediction methods are only as good as the data you put into them. If the data you put into them are biased, you will get out a biased prediction. Here are some examples:

Job loss

Machine learning, but perhaps even more importantly, generative AI will lead to job losses. This is not dissimilar to robots on manufacturing lines, although the job losses will likely be felt by white collar workers rather than blue collar workers.

Privacy and security

As AI systems scour public information, they make accessible information that would formerly have been very difficult to organize.

For example, I asked ChatGPT

where does jarad niemi live?

It responded:

I’m sorry, but as an AI language model, I do not have access to personal information like the current address of individuals, including Jarad Niemi. Privacy is a very important aspect, and it’s not ethical to share personal information without their consent.

So it is nice that OpenAI has taken one step to protect my privacy, but will other AI systems? Can ChatGPT be tricked into providing my personal information?

Technological singularity

Will AI take over and make the human race obsolete? This seems far-fetched but many in the field are warning about it.

A believable example that demonstrates how we can go from where we are to this singularity.