*Probability theory* is the branch of mathematics concerned
with the analysis of random phenomenon.

The mathematical definition of *probability* typically starts
with Kolmogorov’s
axioms of probability. As a reminder, axioms are assumed to be true
(i.e. they can’t be proven).

- The probability of an event is a non-negative real number: \[P(E) \in \mathbb{R}, P(E) \ge 0 \quad\forall E \in F\] where \(F\) is some event space.
- The probability that at least one elementary event in the sample space will occur is 1, i.e. \[P(\Omega)=1\] where \(\Omega\) is the sample space.
- Any countable sequence of disjoint events \(E_1,E_2,\ldots\) satisfies \[P\left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i).\]

Two events are independent if their joint probability is the product of their marginal probabilities, i.e. \[P(A,B) = P(A)P(B).\]

The conditional probability of one event given another event is \[P(A|B) = \frac{P(A,B)}{P(B)} \qquad P(B) > 0.\]

Typically, we start out teaching probability using intuitive
approaches like rolling dice or drawing cards out of a deck. These
examples all rely on our understanding that the elementary outcomes in
the sample space are equally likely, e.g. each side of a die is equally
likely to be rolled. From this belief, we can calculate probabilities
through the formula: \[P(A) =
\frac{|A|}{|\Omega|}\] where \(|cdot|\) is the *cardinality* or
size of the set.

For complicated sets, we rely on the *Fundamental Counting
Principle* which states “if there are \(n\) ways of doing a first thing and \(m\) ways of doing a second thing, then
there are \(n\times m\) ways of doing
both things.”

One application is a *combination*, i.e. how many ways are
there to pull \(k\) items out of a set
of \(n\) items? The number of
combinations are \[C(n,k) = {n\choose k} =
\frac{n!}{(n-k)!k!}.\]

```
# Example combination: C(10,4)
choose(10, 4)
```

`## [1] 210`

Another application is a *permutation*, i.e. how many ways are
there to arrange \(k\) items out of
\(n\)? The number of permutations are
\[P(n,k) = n(n-1)(n-2)\cdots(n-k+1) =
\frac{n!}{(n-k)!}.\] A common question is when \(n=k\), i.e. you are rearranging all \(n\) items. In this case, there are \(n!\) permutations.

```
# Example permutation: P(10,4)
factorial(10) / factorial(4)
```

`## [1] 151200`

A common application of these ideas is to calculate probabilities for the sum of two 6-sided dice.

```
# Probabilities for the sum of two 6-sided dice
d <- data.frame(sum = 2:12, probability = c(1:6,5:1)/36)
plot(probability ~ sum, data = d)
```

Another application is to determine probabilities of obtaining a
particular set of cards drawn from a deck. For example, the probability
of obtaining a *pair*, two cards of the same rank in a standard
deck of 52 cards with 4 suits. The easiest way to understand this
probability is to realize that the first card doesn’t matter, but the
second card delt must match the rank of the first card. Since there are
3 cards left that match the rank and there are 51 cards left in the
deck, the probability is

```
# Probability of a pair when being delt two cards
3/51
```

`## [1] 0.05882353`

While situations with equally likely outcomes are interesting, their primary application is to games of chance and not much else in the scientific world.

Thus to extend the applicability of probability, we introduce the
idea of a random variable. The basic definition of a *random
variable* is that it is a function of the outcome of an experiment
to the real numbers, i.e. \[X: \Omega \to
\mathbb{R}.\] If the image of \(X\) is countable or finite, then \(X\) is a discrete random variable.
Otherwise, \(X\) is a continuous random
variable.

Discrete random variables have a *probability mass function*
that provides the probability for each possible value of the random
variable, i.e. \(f(x)= P(X=x)\), and a
*cumulative distribution function* that provides the probability
the random variable is less than or equal to that value, i.e. \(F(x) = P(X \le x)\).

Discrete random variables can further be classified by whether the image is finite or countably infinite. This is important for the application of statistical methods but less important for the mathematics of discrete random variables.

Independent discrete random variables have the joint probability mass function equal to the marginal probability mass functions. For example, if \(X\) and \(Y\) are two independent, discrete random variables then \[p_{X,Y}(x,y) = P(X=x,Y=y) = P(X=x)P(Y=y) = p_X(x)p_Y(y).\]

The binomial distribution is commonly used when count the number of success out of some number of attempts where each attempt has the same probability and the attempts are independent. Common examples are flipping coins and rolling dice.

Let \(Y\sim Bin(n,p)\) indicate a
binomial random variable with \(n>0\) *attempts* and
*probability of success* \(0<p<1\). The probability mass
function is \[p(y) = {n\choose
y}p^y(1-p)^{n-y}, \quad y=0,1,2,\ldots,n.\] If \(n=1\), then this is also referred to as a
*Bernoulli* random variable.

The expected value (or mean) of a binomial random variable is \(E[Y] = np\) and the variance is \(Var[Y] = np(1-p)\).

Suppose \(Y \sim Bin(25, 0.9)\), i.e. \(Y\) is a binomial random variable with \(25\) attempts and probability of success \(0.9\).

```
#############################################################################
# Binomial
#############################################################################
# Binomial parameters
n <- 25
p <- 0.9
# Expected value (mean)
n*p
```

`## [1] 22.5`

```
# Variance
n*p*(1-p)
```

`## [1] 2.25`

What is the probability that \(Y\) is equal to \(24\)?

```
# Calculate probability using probability mass function
dbinom(24, size = n, prob = p)
```

`## [1] 0.1994161`

What is the probability that \(Y\) is less than \(23\)?

```
# Calculate probability using the cumulative distribution function
# Remember P(Y < y) = P(Y <= y-1)
pbinom(23 - 1, size = n, prob = p)
```

`## [1] 0.4629059`

```
# Alternatively using the probability mass function
sum(dbinom(0:22, size = n, prob = p))
```

`## [1] 0.4629059`

The Poisson distribution is commonly used when our data are counts, but there is clear or obvious maximum possible count. Typically these counts are over some amount of time, space, or space-time. For example, - the number of cars passing through an intersection in an hour, - the number of blades of grass in a square meter, or - the number of clicks on a website in a minute.

Let \(Y\sim Po(\lambda)\) indicate a
Poisson random variable with *rate* \(\lambda>0\). The probability mass
function is \[p(y) = \frac{e^{-\lambda}
\lambda^y}{y!}, \quad y = 0,1,2,\ldots\] The expected value (or
mean) is \(E[Y] = \lambda\) and the
variance is \(Var[Y] = \lambda\).

Suppose \(Y \sim Po(5.4)\), i.e. \(Y\) is a Poisson random variable with rate \(5.4\).

```
#############################################################################
# Poisson
#############################################################################
# Poisson parameter
rate <- 5.4
# Mean
rate
```

`## [1] 5.4`

```
# Variance
rate
```

`## [1] 5.4`

What is the probability \(Y\) is \(4\)?

```
# Calculate Poisson probability using probability mass function
dpois(4, lambda = rate)
```

`## [1] 0.1600198`

What is the probability \(Y\) is above 3 and below 9?

Note that \(P(3 < Y < 9) = P(Y \le 8) - P(Y \le 3)\). This allows us to use the cumulative distribution function.

```
# Calculate probability of a range using the cumulative distribution function
ppois(8, lambda = rate) - ppois(3, lambda = rate) # OR
```

`## [1] 0.6893592`

`diff(ppois(c(3,8), lambda = rate))`

`## [1] 0.6893592`

Also note that \(P(3 < Y < 9) = P(Y = 4) + P(Y = 5) + P(Y = 6) + P(Y = 7) + P(Y = 8)\) due to Kolmogorov’s third axiom.

```
# Calculate probability of a range using the sum of probability mass function values
dpois(4, lambda = rate) +
dpois(5, lambda = rate) +
dpois(6, lambda = rate) +
dpois(7, lambda = rate) +
dpois(8, lambda = rate) # OR
```

`## [1] 0.6893592`

`sum(dpois(4:8, lambda = rate))`

`## [1] 0.6893592`

Continuous random variables have an image that is uncountably
infinite and thus the probability of every value of the random variable
is 0. To calculate the probability of the random variable falling into
an interval \((a,b)\), we integrate (or
find the area under) the *probability density function* between
\(a\) and \(b\). Continuous random variables also have
a *cumulative distribution function* that provides the
probability the random variable is less than or equal to a particular
value, i.e. \(F(x) = P(X \le x) = P(X <
x)\).

The most important continuous random variable is the normal (or
Gaussian) random variable. Let \(Y\sim
N(\mu,\sigma^2)\) be a normal random variable with *mean*
\(\mu\) and *variance* \(\sigma^2>0\). The probability density
function (PDF) for a normal random variable is \[f(y) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(
-\frac{1}{2\sigma^2} (y-\mu)^2\right).\] This is the canonical
*bell-shaped curve*. If \(\mu=0\) and \(\sigma^2=1\), we have a *standard
normal* random variable.

```
#############################################################################
# Normal
#############################################################################
# Standard normal density, Y ~ N(0,1)
mu <- 0
sigma <- 1 # Standard deviation
# Probability density function (PDF)
curve(dnorm(x, mean = mu, sd = sigma), from = mu-3*sigma, to = mu+3*sigma)
```