library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Recall that estimators are statistics that attempt to estimate a population parameter. We are generally assuming here that we have a simple random sample from the population and we are interested in estimating a mean or proportion from that population.
Let \(X \sim Bin(n,\theta)\). The MLE for \(\theta\) is \[ \hat\theta = \frac{x}{n}. \]
Let \(Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2)\).
The MLE for \(\mu\) is \[ \hat\mu = \overline{y}. \] This also happens to be the least squares estimator and the Bayes estimator (assuming the standard default prior).
The MLE for \(\sigma^2\) is \[ \hat\sigma^2_{MLE} = \frac{1}{n} \sum_{i=1}^n (y_i-\overline{y})^2 \] while the sample variance is \[ s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i-\overline{y})^2. \] So why might we prefer one of these estimators over another?
The regression model (in multivariate normal form is \[ Y \sim N(X\beta,\sigma^2\mathrm{I}). \] The MLE, least squares, and Bayes estimator for \(\beta\) is \[ \hat\beta = (X^\top X)^{-1}X^\top y. \]
There are a variety of reasons we might use to choose an estimator. We might appreciate the universality of the maximum likelihood estimator or the ability to incorporate prior information in Bayes estimator. Here we will talk about some properties of estimators, namely:
The bias of an estimator is the expectation of the estimator minus the true value of the estimator: \[ \mbox{Bias}[\hat\theta] = E[\hat\theta - \theta] = E[\hat\theta] - \theta. \]
An estimator is unbiased when the bias is 0 or, equivalently, its expectation is equal to the true value.
For example,
indicate that all of these estimators are unbiased.
In contrast
and thus the bias is
Clearly this bias diminishes as \(n\to \infty\).
The variance of an estimator is just the variance of the statistic: \[ Var[\hat\theta] = E[(\hat\theta - E[\hat\theta])^2]. \]
Here are a couple of examples:
Notice how both of these decrease as \(n\to \infty\).
The mean squared error of an estimator is the expectation of the squared difference between the estimator and the truth. This turns out to have a simple formula in terms of the bias and variance.
\[ MSE[\hat\theta] = E[(\hat\theta - \theta)^2] = Var[\hat\theta] + Bias[\hat\theta]^2 \]
In our examples, we have
For unbiased estimators, the MSE is the same as the variance. For biased estimators, the formula is a bit more exciting.
An estimator is consistent if it gets closer to the true value in probability as the sample size gets larger. Specifically \[ \lim_{n\to\infty} P(|\hat\theta-\theta| > \epsilon) = 0 \] for all \(\epsilon>0\).
For unbiased estimators, the estimator is consistent if the variance converges to zero as the sample size increases.
Every estimator discussed above is consistent.
Let’s use Monte Carlo to evaluate these estimators. We will need to obtain many samples to estimate bias and variance. To investigate consistency, we will need many samples at a sequence of sample sizes.
Throughout this section we will utilize the CLT for an expectation to construct a CI. Recall if \(X_i\) are independent and identically distributed with \(E[X_i] = \mu\) and \(Var[X_i] = \sigma^2 < \infty\), then a 100(1-a)% CI is \[ \overline{x} \pm z s / \sqrt{n} \] where
We will encapsulate this in a function
create_ci <- function(x, a = 0.05) {
n <- length(x)
m <- mean(x)
s <- sd(x)
return(m + c(-1,1)*qnorm(1-a/2)*s/sqrt(n))
}
Recall that the MLE for the probability of success in a binomial is \[ \hat\theta_{MLE} = x/n. \]
n <- 10
p <- 0.3
x <- rbinom(1e3, size = n, prob = p)
theta_hat = x / n
ggplot() +
geom_histogram(mapping = aes(x = theta_hat), bins = 30) +
geom_vline(xintercept = p, col = 'red') +
labs(title="Sampling distribution for sample proportion")
mean(theta_hat) # expectation
## [1] 0.2988
mean(theta_hat - p) # bias
## [1] -0.0012
create_ci(theta_hat - p) # bias CI
## [1] -0.01006524 0.00766524
# variance
mean( (theta_hat - mean(theta_hat))^2 )
## [1] 0.02043856
create_ci( (theta_hat - mean(theta_hat))^2 )
## [1] 0.01871232 0.02216480
Recall that \(MSE[\hat\theta] = E[(\hat\theta-\theta)^2] = Var[\hat\theta] + Bias[\hat\theta]^2\).
# MSE
mean( (theta_hat - p)^2 )
## [1] 0.02044
# MSE CI
create_ci( (theta_hat-p)^2 )
## [1] 0.01871726 0.02216274
epsilon <- 0.01
n_seq <- floor(10^seq(1, 5, length = 10))
p_seq <- numeric(length(n_seq))
for (i in seq_along(n_seq)) {
x <- rbinom(1e3, size = n_seq[i], prob = p)
p_seq[i] <- mean(abs(x/n_seq[i] - p) > epsilon)
}
ggplot(data.frame(n = n_seq, p = p_seq),
aes(x = n, y = p_seq)) +
geom_point() +
geom_line() +
labs(x = "Sample size", y = "Probability",
title = "Consistency of sample proportion")
A default Bayesian analysis for an unknown binomial probability is \[ \hat\theta_{Bayes} = \frac{0.5 + x}{1 + n} \]
n <- 10
p <- 0.3
x <- rbinom(1e3, size = n, prob = p)
theta_hat = (0.5 + x) / (1 + n)
ggplot() +
geom_histogram(mapping = aes(x = theta_hat), bins = 30) +
geom_vline(xintercept = p, col = 'red') +
labs(title="Sampling distribution for sample proportion")
mean(theta_hat) # expectation
## [1] 0.3123636
mean(theta_hat - p) # bias
## [1] 0.01236364
create_ci(theta_hat - p) # bias CI
## [1] 0.003953058 0.020774215
# variance
mean( (theta_hat - mean(theta_hat))^2 )
## [1] 0.0183959
create_ci( (theta_hat - mean(theta_hat))^2 )
## [1] 0.01693049 0.01986131
Recall that \(MSE[\hat\theta] = E[(\hat\theta-\theta)^2] = Var[\hat\theta] + Bias[\hat\theta]^2\).
# MSE
mean( (theta_hat - p)^2 )
## [1] 0.01854876
# MSE CI
create_ci( (theta_hat-p)^2 )
## [1] 0.01703075 0.02006677
epsilon <- 0.01
n_seq <- floor(10^seq(1, 5, length = 10))
p_seq <- numeric(length(n_seq))
for (i in seq_along(n_seq)) {
x <- rbinom(1e3, size = n_seq[i], prob = p)
p_seq[i] <- mean(abs(x/n_seq[i] - p) > epsilon)
}
ggplot(data.frame(n = n_seq, p = p_seq),
aes(x = n, y = p_seq)) +
geom_point() +
geom_line() +
labs(x = "Sample size", y = "Probability",
title = "Consistency of sample proportion")
n <- 10
m <- 0
s <- 1
d <- expand.grid(rep = 1:1e3,
n = 1:n) %>%
mutate(y = rnorm(n())) %>%
group_by(rep) %>%
summarize(
n = n(),
ybar = mean(y),
sd = sd(y)
)
ggplot(d, aes(x = ybar)) +
geom_histogram(bins = 30) +
geom_vline(xintercept = m, color = "red") +
labs(x = "Sample mean",
title = "Sampling distribution of sample mean")
# Bias
mean(d$ybar - m)
## [1] 0.01688741
# Bias with CI
create_ci( d$ybar - m )
## [1] -0.002459445 0.036234259
# Variance
mean( (d$ybar - mean(d$ybar))^2 )
## [1] 0.09733968
# Variance with uncertainty
create_ci( (d$ybar - mean(d$ybar))^2 )
## [1] 0.08853967 0.10613970
# MSE
mean( (d$ybar - m)^2 )
## [1] 0.09762487
# Variance with uncertainty
create_ci( (d$ybar - m)^2 )
## [1] 0.08879887 0.10645086
epsilon <- 0.01
n_seq <- floor(10^seq(1, 5, length = 10))
p_seq <- numeric(length(n_seq))
for (i in seq_along(n_seq)) {
xbar <- replicate(1e3, { # replicate is another looping option
mean( rnorm(n_seq[i], m, s))
})
p_seq[i] <- mean( abs(xbar - m) > epsilon)
}
ggplot(data.frame(n = n_seq, p = p_seq),
aes(x = n, y = p)) +
geom_line() +
geom_point() +
labs(x = "Sample size", y = "Probability",
title = "Consistency of sample mean")