R code


Probability theory is the branch of mathematics concerned with the analysis of random phenomenon.

The mathematical definition of probability typically starts with Kolmogorov’s axioms of probability. As a reminder, axioms are assumed to be true (i.e. they can’t be proven).

  1. The probability of an event is a non-negative real number: \[P(E) \in \mathbb{R}, P(E) \ge 0 \quad\forall E \in F\] where \(F\) is some event space.
  2. The probability that at least one elementary event in the sample space will occur is 1, i.e. \[P(\Omega)=1\] where \(\Omega\) is the sample space.
  3. Any countable sequence of disjoint events \(E_1,E_2,\ldots\) satisfies \[P\left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i).\]


Two events are independent if their joint probability is the product of their marginal probabilities, i.e.  \[P(A,B) = P(A)P(B).\]

Conditional probability

The conditional probability of one event given another event is \[P(A|B) = \frac{P(A,B)}{P(B)} \qquad P(B) > 0.\]

Equally likely outcomes

Typically, we start out teaching probability using intuitive approaches like rolling dice or drawing cards out of a deck. These examples all rely on our understanding that the elementary outcomes in the sample space are equally likely, e.g. each side of a die is equally likely to be rolled. From this belief, we can calculate probabilities through the formula: \[P(A) = \frac{|A|}{|\Omega|}\] where \(|cdot|\) is the cardinality or size of the set.

For complicated sets, we rely on the Fundamental Counting Principle which states “if there are \(n\) ways of doing a first thing and \(m\) ways of doing a second thing, then there are \(n\times m\) ways of doing both things.”


One application is a combination, i.e. how many ways are there to pull \(k\) items out of a set of \(n\) items? The number of combinations are \[C(n,k) = {n\choose k} = \frac{n!}{(n-k)!k!}.\]


Another application is a permutation, i.e. how many ways are there to arrange \(k\) items out of \(n\)? The number of permutations are \[P(n,k) = n(n-1)(n-2)\cdots(n-k+1) = \frac{n!}{(n-k)!}.\] A common question is when \(n=k\), i.e. you are rearranging all \(n\) items. In this case, there are \(n!\) permutations.

Calculating probabilities

A common application of these ideas is to calculate probabilities for the sum of two 6-sided dice.

d <- data.frame(sum = 2:12, probability = c(1:6,5:1)/36)
plot(probability ~ sum, data = d)

Another application is to determine probabilities of obtaining a particular set of cards drawn from a deck. For example, the probability of obtaining a pair, two cards of the same rank in a standard deck of 52 cards with 4 suits. The easiest way to understand this probability is to realize that the first card doesn’t matter, but the second card delt must match the rank of the first card. Since there are 3 cards left that match the rank and there are 51 cards left in the deck, the probability is

## [1] 0.05882353

While situations with equally likely outcomes are interesting, their primary application is to games of chance and not much else in the scientific world.

Thus to extend the applicability of probability, we introduce the idea of a random variable. The basic definition of a random variable is that it is a function of the outcome of an experiment to the real numbers, i.e. \[X: \Omega \to \mathbb{R}.\] If the image of \(X\) is countable, then \(X\) is a discrete random variable. Otherwise, \(X\) is a continuous random variable.


Discrete random variables have a probability mass function that provides the probability for each possible value of the random variable, i.e. \(f(x)= P(X=x)\), and a cumulative distribution function that provides the probability the random variable is less than or equal to that value, i.e. \(F(x) = P(X \le x)\).

Discrete random variables can further be classified by whether the image is finite or countably infinite. This is important for the application of statistical methods but less important for the mathematics of discrete random variables.

Independent discrete random variables have the joint probability mass function equal to the marginal probability mass functions. For example, if \(X\) and \(Y\) are two independent, discrete random variables then \[p_{X,Y}(x,y) = P(X=x,Y=y) = P(X=x)P(Y=y) = p_X(x)p_Y(y).\]


The binomial distribution is commonly used when count the number of success out of some number of attempts where each attempt has the same probability and the attempts are independent. Common examples are flipping coins and rolling dice.

Let \(Y\sim Bin(n,p)\) indicate a binomial random variable with \(n>0\) attempts and probability of success \(0<p<1\). The probability mass function is \[p(y) = {n\choose y}p^y(1-p)^{n-y}, \quad y=0,1,2,\ldots,n.\] If \(n=1\), then this is also referred to as a Bernoulli random variable.

The expected value of a binomial random variable is \(E[Y] = np\) and the variance is \(Var[Y] = np(1-p)\).


The Poisson distribution is commonly used when our data are counts, but there is clear or obvious maximum possible count. Typically these counts are over some amount of time, space, or space-time. For example, - the number of cars passing through an intersection in an hour, - the number of blades of grass in a square meter, or - the number of clicks on a website in a minute.

Let \(Y\sim Po(r)\) indicate a Poisson random variable with rate \(r>0\). The probability mass function is \[p(y) = \frac{e^{-r} r^y}{y!}, \quad y = 0,1,2,\ldots\] The expected value is \(E[Y] = r\) and the variance is \(Var[Y] = r\).


Continuous random variables have an image that is uncountably infinite and thus the probability of every value of the random variable is 0. To calculate the probability of the random variable falling into an interval \((a,b)\), we integrate (or find the area under) the probability density function between \(a\) and \(b\). Continuous random variables also have a cumulative distribution function that provides the probability the random variable is less than or equal to a particular value, i.e. \(F(x) = P(X \le x) = P(X < x)\).


The most important continuous random variable is the normal (or Gaussian) random variable. Let \(Y\sim N(m,C)\) be a normal random variable with mean \(m\) and variance \(C>0\). The probability density function for a normal random variable is \[f(y) = \frac{1}{\sqrt{2\pi C}} \exp\left( -\frac{1}{2C} (y-m)^2\right).\] This is the canonical bell-shaped curve.

curve(dnorm, -3, 3)

The expectation is \(E[Y] = m\) and the variance is \(Var[Y] = C\). While the variance is mathematically more intuitive, the standard deviation (square root of the variance) is more relevant because - 68% of normal random variables are within 1 standard deviation of the mean - 95% of normal random variables are within 2 standard deviations of the mean - 99.7% of normal random variables are within 3 standard deviations of the mean.

If \(m=0\) and \(C=1\), we have a standard normal random variable.

Central Limit Theorem

The normal distribution is so important because of the result of the Central Limit Theorem (CLT). The CLT states that sums and averages of independent, identically distributed random variables (with a finite variance) have an approximate normal distribution for large enough sample sizes.

Specifically, if \(X_1,X_2,\ldots,X_n\) are independent with \(E[X_i] = m\) and \(Var[X_i] = V\) for all \(i\), then \[\overline{X} \stackrel{\cdot}{\sim} N(m, V/n)\] and \[S = n\overline{X} \stackrel{\cdot}{\sim} N(nm, nV)\] for sufficiently large \(n\) where \(\stackrel{\cdot}{\sim}\) means approximately distributed.


One of the most important skills as a data analyst the abstraction of the scientific problem into a standard statistical method. Part of this abstraction involves identifying the structure of the data in terms of the type of variables involved. The type of variable involved are closely related with distributional assumptions. The two main data types are categorical and continuous.


Categorical (or qualitative) data are non-numeric. There are two main types of categorical data: ordinal and nominal.


For ordinal data, order matters. An example of ordinal data is levels of education: high school, bachelor’s, master’s, PhD.


A very common type of ordinal data that often occurs on surveys is Likert scale data, e.g. strongly disagree, disagree, neither, agree, and strongly agree.

It is extremely common to analyze these data using a numeric scale, e.g.  1 to 5 to correspond to the ordinal scale. This is a strong assumption and not required as there are statistical methodologies that can analyze the data directly on the ordinal scale.


For nominal data, order does not matter. An example of ordinal data is type of fruit: apple, banana, orange, etc.


When a categorical variable only has two levels it is binary. Some examples of categorical binary variables are

  • success/failure
  • true/false
  • above/below

Typically we convert this categorical variable into a quantitative variable by assigning one of the two categories to 0 and the other to 1. Then, we will typically model binary variables using the Bernoulli distribution.


Quantitative are numeric. The two main types of quantitative data are continuous and count.


Continuous data are numeric data that can take on any value within a range. In practice, we are often limited in the precision with which we can measure something, but it is often easier to treat the resulting data as continuous.

The two most common types of continuous variables are those that are strictly positive and those that can be any real number.


The most common type of continuous variable is positive. For example, the following are all examples of continuous variables that are strictly positive

  • length
  • area
  • mass
  • temperature
  • time/duration
  • salary

A common approach to analyzing strictly positive variables is to take a logarithm in which case you now have an unrestricted continuous variable.


Many continuous variables can be any real number and are therefore unrestricted. Although less common than positive variables, these variables often result from taking differences of positive variables or taking logarithms of a single positive variable.

We will often transform variables to this unrestricted space in order to satisfy assumptions of the normal distribution.


Another type of numeric variable is a count variable. Counts are restricted to non-negative integers, i.e. whole numbers.


Binary numeric variables are simply 0 or 1. Note that any variable can be transformed into a binary variable and therefore, in some sense, this is the universal variable. Some examples of binary variables are

Binary variables are typically modeled using the Bernoulli distribution.

Count with clear maximum

Count data that have a clear maximum occur frequently as the count of some successes in some number of attempts. For example,

  • number of free throws made in 20 attempts
  • number of houses in my neighborhood damaged by a storm out of the 65 houses
  • number of correct answers on a test out of the 99 students

If we can assume independence amongst all the attempts, then we can utilize a binomial distribution. If we cannot, then we will typically use a Bernoulli distribution with a regression structure, e.g. logistic regression.

Count with no clear maximum

Often counts have no clear or obvious maximum. For example,

  • number of free throws made in a game
  • number of students enrolled
  • number of books on my shelf

In these situations, we commonly use a Poisson distribution to model these counts. Recall that the Poisson distribution has a variance equal to the mean. If the variance is greater than the mean, we will typically use the negative binomial distribution. (This is harder than just calculating the mean and calculating the variance.)


Although in theory any data we measure can be categorized as above. Sometimes we don’t actual observe the data that could have been observed because we end the experiment early or we only record data every so often. In these situations the data become censored. The two most common types of censoring are

  • right-censored: we know the data are larger than some number
  • interval-censored: we know the data are in some range

When the censoring is large or occurs for a large number of the observations, then we need to be careful with our analyses. These types of analyses are discussed in survival or reliability analysis.