R code

Probability

Probability theory is the branch of mathematics concerned with the analysis of random phenomenon.

The mathematical definition of probability typically starts with Kolmogorov’s axioms of probability. As a reminder, axioms are assumed to be true (i.e. they can’t be proven).

  1. The probability of an event is a non-negative real number: \[P(E) \in \mathbb{R}, P(E) \ge 0 \quad\forall E \in F\] where \(F\) is some event space.
  2. The probability that at least one elementary event in the sample space will occur is 1, i.e. \[P(\Omega)=1\] where \(\Omega\) is the sample space.
  3. Any countable sequence of disjoint events \(E_1,E_2,\ldots\) satisfies \[P\left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i).\]

Independence

Two events are independent if their joint probability is the product of their marginal probabilities, i.e.  \[P(A,B) = P(A)P(B).\]

Conditional probability

The conditional probability of one event given another event is \[P(A|B) = \frac{P(A,B)}{P(B)} \qquad P(B) > 0.\]

Equally likely outcomes

Typically, we start out teaching probability using intuitive approaches like rolling dice or drawing cards out of a deck. These examples all rely on our understanding that the elementary outcomes in the sample space are equally likely, e.g. each side of a die is equally likely to be rolled. From this belief, we can calculate probabilities through the formula: \[P(A) = \frac{|A|}{|\Omega|}\] where \(|cdot|\) is the cardinality or size of the set.

For complicated sets, we rely on the Fundamental Counting Principle which states “if there are \(n\) ways of doing a first thing and \(m\) ways of doing a second thing, then there are \(n\times m\) ways of doing both things.”

Combinations

One application is a combination, i.e. how many ways are there to pull \(k\) items out of a set of \(n\) items? The number of combinations are \[C(n,k) = {n\choose k} = \frac{n!}{(n-k)!k!}.\]

# Example combination: C(10,4)
choose(10, 4) 
## [1] 210

Permutations

Another application is a permutation, i.e. how many ways are there to arrange \(k\) items out of \(n\)? The number of permutations are \[P(n,k) = n(n-1)(n-2)\cdots(n-k+1) = \frac{n!}{(n-k)!}.\] A common question is when \(n=k\), i.e. you are rearranging all \(n\) items. In this case, there are \(n!\) permutations.

# Example permutation: P(10,4)
factorial(10) / factorial(4)
## [1] 151200

Calculating probabilities

A common application of these ideas is to calculate probabilities for the sum of two 6-sided dice.

# Probabilities for the sum of two 6-sided dice
d <- data.frame(sum = 2:12, probability = c(1:6,5:1)/36)
plot(probability ~ sum, data = d)

Another application is to determine probabilities of obtaining a particular set of cards drawn from a deck. For example, the probability of obtaining a pair, two cards of the same rank in a standard deck of 52 cards with 4 suits. The easiest way to understand this probability is to realize that the first card doesn’t matter, but the second card delt must match the rank of the first card. Since there are 3 cards left that match the rank and there are 51 cards left in the deck, the probability is

# Probability of a pair when being delt two cards
3/51
## [1] 0.05882353

While situations with equally likely outcomes are interesting, their primary application is to games of chance and not much else in the scientific world.

Thus to extend the applicability of probability, we introduce the idea of a random variable. The basic definition of a random variable is that it is a function of the outcome of an experiment to the real numbers, i.e. \[X: \Omega \to \mathbb{R}.\] If the image of \(X\) is countable or finite, then \(X\) is a discrete random variable. Otherwise, \(X\) is a continuous random variable.

Discrete

Discrete random variables have a probability mass function that provides the probability for each possible value of the random variable, i.e. \(f(x)= P(X=x)\), and a cumulative distribution function that provides the probability the random variable is less than or equal to that value, i.e. \(F(x) = P(X \le x)\).

Discrete random variables can further be classified by whether the image is finite or countably infinite. This is important for the application of statistical methods but less important for the mathematics of discrete random variables.

Independent discrete random variables have the joint probability mass function equal to the marginal probability mass functions. For example, if \(X\) and \(Y\) are two independent, discrete random variables then \[p_{X,Y}(x,y) = P(X=x,Y=y) = P(X=x)P(Y=y) = p_X(x)p_Y(y).\]

Binomial

The binomial distribution is commonly used when count the number of success out of some number of attempts where each attempt has the same probability and the attempts are independent. Common examples are flipping coins and rolling dice.

Let \(Y\sim Bin(n,p)\) indicate a binomial random variable with \(n>0\) attempts and probability of success \(0<p<1\). The probability mass function is \[p(y) = {n\choose y}p^y(1-p)^{n-y}, \quad y=0,1,2,\ldots,n.\] If \(n=1\), then this is also referred to as a Bernoulli random variable.

The expected value (or mean) of a binomial random variable is \(E[Y] = np\) and the variance is \(Var[Y] = np(1-p)\).

Example

Suppose \(Y \sim Bin(25, 0.9)\), i.e. \(Y\) is a binomial random variable with \(25\) attempts and probability of success \(0.9\).

#############################################################################
# Binomial
#############################################################################


# Binomial parameters
n <- 25
p <- 0.9

# Expected value (mean)
n*p
## [1] 22.5
# Variance
n*p*(1-p)
## [1] 2.25

What is the probability that \(Y\) is equal to \(24\)?

# Calculate probability using probability mass function
dbinom(24, size = n, prob = p)
## [1] 0.1994161

What is the probability that \(Y\) is less than \(23\)?

# Calculate probability using the cumulative distribution function
# Remember P(Y < y) = P(Y <= y-1)
pbinom(23 - 1, size = n, prob = p)
## [1] 0.4629059
# Alternatively using the probability mass function
sum(dbinom(0:22, size = n, prob = p))
## [1] 0.4629059

Poisson

The Poisson distribution is commonly used when our data are counts, but there is clear or obvious maximum possible count. Typically these counts are over some amount of time, space, or space-time. For example, - the number of cars passing through an intersection in an hour, - the number of blades of grass in a square meter, or - the number of clicks on a website in a minute.

Let \(Y\sim Po(\lambda)\) indicate a Poisson random variable with rate \(\lambda>0\). The probability mass function is \[p(y) = \frac{e^{-\lambda} \lambda^y}{y!}, \quad y = 0,1,2,\ldots\] The expected value (or mean) is \(E[Y] = \lambda\) and the variance is \(Var[Y] = \lambda\).

Example

Suppose \(Y \sim Po(5.4)\), i.e. \(Y\) is a Poisson random variable with rate \(5.4\).

#############################################################################
# Poisson
#############################################################################

# Poisson parameter
rate <- 5.4

# Mean
rate
## [1] 5.4
# Variance
rate
## [1] 5.4

What is the probability \(Y\) is \(4\)?

# Calculate Poisson probability using probability mass function
dpois(4, lambda = rate)
## [1] 0.1600198

What is the probability \(Y\) is above 3 and below 9?

Note that \(P(3 < Y < 9) = P(Y \le 8) - P(Y \le 3)\). This allows us to use the cumulative distribution function.

# Calculate probability of a range using the cumulative distribution function
ppois(8, lambda = rate) - ppois(3, lambda = rate) # OR
## [1] 0.6893592
diff(ppois(c(3,8), lambda = rate))
## [1] 0.6893592

Also note that \(P(3 < Y < 9) = P(Y = 4) + P(Y = 5) + P(Y = 6) + P(Y = 7) + P(Y = 8)\) due to Kolmogorov’s third axiom.

# Calculate probability of a range using the sum of probability mass function values
dpois(4, lambda = rate) + 
  dpois(5, lambda = rate) +
  dpois(6, lambda = rate) +
  dpois(7, lambda = rate) +
  dpois(8, lambda = rate)   # OR
## [1] 0.6893592
sum(dpois(4:8, lambda = rate))
## [1] 0.6893592

Continuous

Continuous random variables have an image that is uncountably infinite and thus the probability of every value of the random variable is 0. To calculate the probability of the random variable falling into an interval \((a,b)\), we integrate (or find the area under) the probability density function between \(a\) and \(b\). Continuous random variables also have a cumulative distribution function that provides the probability the random variable is less than or equal to a particular value, i.e. \(F(x) = P(X \le x) = P(X < x)\).

Normal

The most important continuous random variable is the normal (or Gaussian) random variable. Let \(Y\sim N(\mu,\sigma^2)\) be a normal random variable with mean \(\mu\) and variance \(\sigma^2>0\). The probability density function (PDF) for a normal random variable is \[f(y) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{1}{2\sigma^2} (y-\mu)^2\right).\] This is the canonical bell-shaped curve. If \(\mu=0\) and \(\sigma^2=1\), we have a standard normal random variable.

#############################################################################
# Normal
#############################################################################

# Standard normal density, Y ~ N(0,1)
mu    <- 0
sigma <- 1 # Standard deviation

# Probability density function (PDF)
curve(dnorm(x, mean = mu, sd = sigma), from = mu-3*sigma, to = mu+3*sigma)

The expectation is \(E[Y] = \mu\) and the variance is \(Var[Y] = \sigma^2\). While the variance is mathematically more convenient, the standard deviation (\(\sigma\)) is more intuitive because

  • 68% of normal random variables are within 1 standard deviation of the mean
  • 95% of normal random variables are within 2 standard deviations of the mean
  • 99.7% of normal random variables are within 3 standard deviations of the mean.

The cumulative distribution function (CDF) is the probability of being less than (or equal to) some value. For a continuous random variable, this is the integral from negative infinity up to the value.

\[P(Y \le y) = P(Y < y) = \int_{-\infty}^y f(x) dx.\]

# P(Y < 0.5)
xmax <- 0.5

# Using CDF
pnorm(xmax, mean = mu, sd = sigma) # OR
## [1] 0.6914625
integrate(dnorm, -Inf, xmax, mean = mu, sd = sigma)
## 0.6914625 with absolute error < 1.6e-09

This can be visualized as the area under the PDF up to \(y\). For example, here is the area

# Visualizing the area under the PDF
x <- seq(from = mu - 3*sigma, to = mu + 3*sigma, length = 1001)
y <- dnorm(x, mean = mu, sd = sigma)

plot(x, y, type = "l")
polygon(c( x[x<=xmax], xmax, min(x)), c(y[x<=xmax], 0, 0), col="red")

Example

Suppose \(Y \sim N(3, 4^2)\), i.e. \(Y\) is a normal distribution with mean \(3\) and variance \(4^2 = 16\) (or, equivalently, standard deviation \(4\)).

# Normal parameters
mu    <- 3
sigma <- 4

# Mean
mu
## [1] 3
# Variance
sigma^2
## [1] 16
# Standard deviation
sigma
## [1] 4

What is the probability that \(Y = 2\)? This is a trick question, as the answer is 0. For any continuous random variable, the answer is 0.

What is the value of the probability density function for \(Y=2\)?

# What is the probability density function for Y=2?
dnorm(2, mean = mu, sd = sigma)
## [1] 0.09666703

What is the probability this random variable is between -0.5 and 2?

# Probability of a range using CDF
pnorm(2, mean = mu, sd = sigma) - pnorm(-0.5, mean = mu, sd = sigma)
## [1] 0.2105067
diff(pnorm(c(-0.5, 2), mean = mu, sd = sigma))
## [1] 0.2105067
# Probability of a range using PDF
integrate(dnorm, lower = -0.5, upper = 2)
## 0.6687123 with absolute error < 7.4e-15

Central Limit Theorem

The normal distribution is so important because of the result of the Central Limit Theorem (CLT). The CLT states that sums and averages of independent, identically distributed random variables (with a finite variance) have an approximate normal distribution for large enough sample sizes.

Specifically, if \(X_1,X_2,\ldots,X_n\) are independent with \(E[X_i] = \mu\) and \(Var[X_i] = \sigma^2\) for all \(i\), then \[\overline{X} \stackrel{\cdot}{\sim} N(\mu, \sigma^2/n)\] and \[S = n\overline{X} \stackrel{\cdot}{\sim} N(n\mu, n\sigma^2)\] for sufficiently large \(n\) where \(\stackrel{\cdot}{\sim}\) means approximately distributed.

Data

One of the most important skills as a data analyst the abstraction of the scientific problem into a standard statistical method. Part of this abstraction involves identifying the structure of the data in terms of the type of variables involved. The type of variable involved are closely related with distributional assumptions. The two main data types are categorical and continuous.

Categorical

Categorical (or qualitative) data are non-numeric. There are two main types of categorical data: ordinal and nominal.

Ordinal

For ordinal data, order matters. An example of ordinal data is levels of education: high school, bachelor’s, master’s, PhD.

Likert

A very common type of ordinal data that often occurs on surveys is Likert scale data, e.g. strongly disagree, disagree, neither, agree, and strongly agree.

It is extremely common to analyze these data using a numeric scale, e.g.  1 to 5 to correspond to the ordinal scale. This is a strong assumption and not required as there are statistical methodologies that can analyze the data directly on the ordinal scale.

Nominal

For nominal data, order does not matter. An example of ordinal data is type of fruit: apple, banana, orange, etc.

Binary

When a categorical variable only has two levels it is binary. Some examples of categorical binary variables are

  • success/failure
  • true/false
  • above/below

Typically we convert this categorical variable into a quantitative variable by assigning one of the two categories to 0 and the other to 1. Then, we will typically model binary variables using the Bernoulli distribution.

Quantitative

Quantitative are numeric. The two most common types of quantitative data are continuous and count.

Continuous

Continuous data are numeric data that can take on any value within a range. In practice, we are often limited in the precision with which we can measure something, but it is often easier to treat the resulting data as continuous.

The two most common types of continuous variables are those that are strictly positive and those that can be any real number.

Positive

The most common type of continuous variable is positive. For example, the following are all examples of continuous variables that are strictly positive

  • length
  • area
  • mass
  • temperature
  • time/duration
  • salary

A common approach to analyzing strictly positive variables is to take a logarithm in which case you now have an unrestricted continuous variable.

Unrestricted

Many continuous variables can be any real number and are therefore unrestricted. Although less common than positive variables, these variables often result from taking differences of positive variables or taking logarithms of a single positive variable.

We will often transform variables to this unrestricted space in order to satisfy assumptions of the normal distribution.

Count

Another type of numeric variable is a count variable. Counts are restricted to non-negative integers, i.e. whole numbers.

Binary

Binary numeric variables are simply 0 or 1. Note that any variable can be transformed into a binary variable and therefore, in some sense, this is the universal variable. Some examples of binary variables are

Binary variables are typically modeled using the Bernoulli distribution.

Count with clear maximum

Count data that have a clear maximum occur frequently as the count of some successes in some number of attempts. For example,

  • number of free throws made in 20 attempts
  • number of houses in my neighborhood damaged by a storm out of the 65 houses
  • number of correct answers on a test out of the 99 students

If we can assume independence amongst all the attempts, then we can utilize a binomial distribution. If we cannot, then we will typically use a Bernoulli distribution with a regression structure, e.g. logistic regression.

Count with no clear maximum

Often counts have no clear or obvious maximum. For example,

  • number of free throws made in a game
  • number of students enrolled
  • number of books on my shelf

In these situations, we commonly use a Poisson distribution to model these counts. Recall that the Poisson distribution has a variance equal to the mean. If the variance is greater than the mean, we will typically use the negative binomial distribution. (This is harder than just calculating the mean and calculating the variance.)

Censored

Although in theory any data we measure can be categorized as above. Sometimes we don’t actual observe the data that could have been observed because we end the experiment early or we only record data every so often. In these situations the data become censored. The two most common types of censoring are

  • right-censored: we know the data are larger than some number
  • interval-censored: we know the data are in some range

When the censoring is large or occurs for a large number of the observations, then we need to be careful with our analyses. These types of analyses are discussed in survival or reliability analysis.