R code

Statistics

Statistics can generally be broken up into exploratory statistics and inferential statistics. Exploratory statistics is generally what most of the population thinks about when you say statistics, e.g. means, medians, figures, etc.

Exploratory statistics

When we collect data, we should always perform some exploratory analysis on it. The type of exploratory analysis we perform will depend on the type of data obtained. Generally, exploratory statistics can be categorized into descriptive and graphical statistics.

Descriptive statistics

Descriptive statistics are utilized when you want to numerically quantify some aspect of your data. If your data are categorical, then you are probably counting the number of observations in each category. If your data are continuous, then you are typically calculating some sample statistics.

  • measures of location
    • min/max
    • mean
    • median
  • measures of spread
    • range
    • interquartile range
    • standard deviation

These sample statistics provide a concise description of the data, but frankly are quite lacking for any type of understanding.

Graphical statistics

Much more powerful than descriptive statistics are graphical statistics that provides summaries of data in the form of graphics or figures. Some common graphical statistics are

  • pie charts [I’m screaming inside]
  • bar charts [I’m still screaming inside]
  • histograms [I’m calming down…but just a little]
  • boxplots [here we go again]
  • scatterplots [whew…I’m relaxed]

Constructing useful graphical statistics is a widely under-utilized skill!! If you can convey relationships in data using graphics, then you don’t need inferential statistics (regardless of what your statistics professors tell you).

Inferential statistics

Inferential statistics encompasses most of what you have probably learned in your prerequisite statistics courses including p-values, hypothesis tests, and confidence intervals.

But lost in all of those formulas are the two most important concepts in making inferences:

  • to infer about a population, you need a random sample from that population
  • to infer cause-and-effect, you must randomly assign the treatment

When the sample is not a random sample, then all your inferential statistics are just convenient summaries of the data, but say nothing about the population you intend to study.

When the treatment is not randomly assigned, then you only have an association and not a causation.

Thus there is a limit to what statistics can do with a particular data set, based on how that data were obtained.

Now…on with the formulas.

Binomial

If we have binomial data, then we are typically interested in understanding the probability of success. (There are really interesting problems where you don’t know the number of attempts, e.g. estimating population sizes.)

Maximum likelihood estimator

If \(Y\sim Bin(n,\theta)\), then the likelihood for \(\theta\) is \[L(\theta) = {n\choose y}\theta^y(1-\theta)^{n-y}, \qquad 0<\theta<1\]. The maximum likelihood estimator for \(\theta\) is \[\hat\theta_{MLE} = y/n\] as this is the value that maximizes the likelihood \(L(\theta)\).

Confidence interval

A 100(1-a)% (approximate) confidence interval for \(\theta\) based on \(y\) successes out of \(n\) attempts is \[\hat\theta \pm z_{a/2}\sqrt{\hat\theta(1-\hat\theta)}\] where \(\hat\theta = y/n\) and \(z_{a/2}\) is the z-critical value such that \[a/2 = \int_{z_{a/2}}^\infty \phi(x) dx\] where \(\phi(x)\) is the probability density function for a standard normal. This formula is based on an asymptotic (sample size going to infinity) argument based on the CLT.

This is one formula, but there are a whole bunch of others. So, even in the simplest possible models, things are not so clear. Please note the Bayesian approach.

Hypothesis test

To conduct a hypothesis test with null hypothesis that the true probability of success is \(\theta_0\), we calculate \[z = \frac{\hat\theta-\theta_0}{\sqrt{\frac{\theta_0(1-\theta_0)}{n}}}.\] When \(z\) is sufficiently large, this has an approximate standard normal and we can calculate a two-sided \(p\)-value \[p\mbox{-value} = 2P(Z < -|z|).\] If \(p\)-value is less than our significance level \(a\), then we reject the null model and, otherwise, we fail to reject the null model. The null model is \[Y \sim Bin(n,\theta_0).\] What does this mean? It means there is evidence against this model, but that could mean

  • the attempts are not independent
  • the attempts don’t have a common probability
  • the common probability is not \(\theta_0\)
  • the number of attempts is not \(n\)

The result of the test does not tell us which of these is the problem. In addition, even if everything is true except (possibly) the common probability \(\theta_0\). Our interpretation of the \(p\)-value requires more context about the scientific setup.

I highly recommend every one read the ASA’s Statement on \(p\)-Values. Then reread it periodically until it makes some sense.

Normal

If we collect data that are normally distributed, we typically don’t know the mean or the variance. Thus we have \(Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2)\). Our goal is often to make statements about \(\mu\), but we may also be interested in \(\sigma\).

Maximum likelihood estimator

The likelihood for \(\mu\) and \(\sigma^2\) is \[L(\mu,\sigma^2) = \prod_{i=1}^n (2\pi \sigma^2)^{-1/2}\exp\left(-\frac{1}{2\sigma^2}(y_i-\mu)^2\right)= (2\pi \sigma^2)^{-n/2}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2\right).\] The maximum likelihood estimator for \(\mu\) and \(\sigma^2\) is \[\hat\mu_{MLE} = \overline{y} \qquad \mbox{and} \qquad \hat\sigma^2=\frac{1}{n}\sum_{i=1}^n (y_i-\overline{y})^2 = \frac{n-1}{n}s^2\] where \(\overline{y}\) is the sample mean and \(s^2\) is the sample variance.

Confidence interval for \(\mu\)

A 100(1-a)% confidence interval for \(\mu\) is \[\overline{y} \pm t_{n-1,a/2} s/\sqrt{n}\] where \(t_{n-1,a/2}\) is the \(t\)-critical value using the Student’s \(T\)-distribution with \(n-1\) degrees of freedom.

Hypothesis test

A hypothesis test with null hypothesis \(\mu=\mu_0\) calculates the \(t\)-statistic \[t = \frac{\overline{y}-\mu_0}{s/\sqrt{n}}.\] A \(p\)-value is calculated using \[p\mbox{-value} = 2P(T_{n-1} < -|t|)\] where \(T_{n-1}\) is a Student’s \(T\)-distribution with \(n-1\) degrees if freedom.

If \(p\)-value is less than our significance level \(\alpha\), we reject the null model. Otherwise, we fail to reject the null model. All the same caveats about hypothesis tests mentioned above apply.

Read the damn ASA Statement on \(p\)-Values!