*Probability theory* is the branch of mathematics concerned
with the analysis of random phenomenon.

The mathematical definition of *probability* typically starts
with Kolmogorov’s
axioms of probability. As a reminder, axioms are assumed to be true
(i.e. they can’t be proven).

- The probability of an event is a non-negative real number: \[P(E) \in \mathbb{R}, P(E) \ge 0 \quad\forall E \in F\] where \(F\) is some event space.
- The probability that at least one elementary event in the sample space will occur is 1, i.e. \[P(\Omega)=1\] where \(\Omega\) is the sample space.
- Any countable sequence of disjoint events \(E_1,E_2,\ldots\) satisfies \[P\left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i).\]

Two events are independent if their joint probability is the product of their marginal probabilities, i.e. \[P(A,B) = P(A)P(B).\]

The conditional probability of one event given another event is \[P(A|B) = \frac{P(A,B)}{P(B)} \qquad P(B) > 0.\]

Typically, we start out teaching probability using intuitive
approaches like rolling dice or drawing cards out of a deck. These
examples all rely on our understanding that the elementary outcomes in
the sample space are equally likely, e.g. each side of a die is equally
likely to be rolled. From this belief, we can calculate probabilities
through the formula: \[P(A) =
\frac{|A|}{|\Omega|}\] where \(|cdot|\) is the *cardinality* or
size of the set.

For complicated sets, we rely on the *Fundamental Counting
Principle* which states “if there are \(n\) ways of doing a first thing and \(m\) ways of doing a second thing, then
there are \(n\times m\) ways of doing
both things.”

One application is a *combination*, i.e. how many ways are
there to pull \(k\) items out of a set
of \(n\) items? The number of
combinations are \[C(n,k) = {n\choose k} =
\frac{n!}{(n-k)!k!}.\]

Another application is a *permutation*, i.e. how many ways are
there to arrange \(k\) items out of
\(n\)? The number of permutations are
\[P(n,k) = n(n-1)(n-2)\cdots(n-k+1) =
\frac{n!}{(n-k)!}.\] A common question is when \(n=k\), i.e. you are rearranging all \(n\) items. In this case, there are \(n!\) permutations.

A common application of these ideas is to calculate probabilities for the sum of two 6-sided dice.

```
d <- data.frame(sum = 2:12, probability = c(1:6,5:1)/36)
plot(probability ~ sum, data = d)
```

Another application is to determine probabilities of obtaining a
particular set of cards drawn from a deck. For example, the probability
of obtaining a *pair*, two cards of the same rank in a standard
deck of 52 cards with 4 suits. The easiest way to understand this
probability is to realize that the first card doesn’t matter, but the
second card delt must match the rank of the first card. Since there are
3 cards left that match the rank and there are 51 cards left in the
deck, the probability is

`3/51`

`## [1] 0.05882353`

While situations with equally likely outcomes are interesting, their primary application is to games of chance and not much else in the scientific world.

Thus to extend the applicability of probability, we introduce the
idea of a random variable. The basic definition of a *random
variable* is that it is a function of the outcome of an experiment
to the real numbers, i.e. \[X: \Omega \to
\mathbb{R}.\] If the image of \(X\) is countable, then \(X\) is a discrete random variable.
Otherwise, \(X\) is a continuous random
variable.

Discrete random variables have a *probability mass function*
that provides the probability for each possible value of the random
variable, i.e. \(f(x)= P(X=x)\), and a
*cumulative distribution function* that provides the probability
the random variable is less than or equal to that value, i.e. \(F(x) = P(X \le x)\).

Discrete random variables can further be classified by whether the image is finite or countably infinite. This is important for the application of statistical methods but less important for the mathematics of discrete random variables.

Independent discrete random variables have the joint probability mass function equal to the marginal probability mass functions. For example, if \(X\) and \(Y\) are two independent, discrete random variables then \[p_{X,Y}(x,y) = P(X=x,Y=y) = P(X=x)P(Y=y) = p_X(x)p_Y(y).\]

The binomial distribution is commonly used when count the number of success out of some number of attempts where each attempt has the same probability and the attempts are independent. Common examples are flipping coins and rolling dice.

Let \(Y\sim Bin(n,p)\) indicate a
binomial random variable with \(n>0\) *attempts* and
*probability of success* \(0<p<1\). The probability mass
function is \[p(y) = {n\choose
y}p^y(1-p)^{n-y}, \quad y=0,1,2,\ldots,n.\] If \(n=1\), then this is also referred to as a
*Bernoulli* random variable.

The expected value of a binomial random variable is \(E[Y] = np\) and the variance is \(Var[Y] = np(1-p)\).

The Poisson distribution is commonly used when our data are counts, but there is clear or obvious maximum possible count. Typically these counts are over some amount of time, space, or space-time. For example, - the number of cars passing through an intersection in an hour, - the number of blades of grass in a square meter, or - the number of clicks on a website in a minute.

Let \(Y\sim Po(r)\) indicate a
Poisson random variable with *rate* \(r>0\). The probability mass function is
\[p(y) = \frac{e^{-r} r^y}{y!}, \quad y =
0,1,2,\ldots\] The expected value is \(E[Y] = r\) and the variance is \(Var[Y] = r\).

Continuous random variables have an image that is uncountably
infinite and thus the probability of every value of the random variable
is 0. To calculate the probability of the random variable falling into
an interval \((a,b)\), we integrate (or
find the area under) the *probability density function* between
\(a\) and \(b\). Continuous random variables also have
a *cumulative distribution function* that provides the
probability the random variable is less than or equal to a particular
value, i.e. \(F(x) = P(X \le x) = P(X <
x)\).

The most important continuous random variable is the normal (or
Gaussian) random variable. Let \(Y\sim
N(m,C)\) be a normal random variable with *mean* \(m\) and *variance* \(C>0\). The probability density function
for a normal random variable is \[f(y) =
\frac{1}{\sqrt{2\pi C}} \exp\left( -\frac{1}{2C}
(y-m)^2\right).\] This is the canonical *bell-shaped
curve*.

`curve(dnorm, -3, 3)`

The expectation is \(E[Y] = m\) and the variance is \(Var[Y] = C\). While the variance is mathematically more intuitive, the standard deviation (square root of the variance) is more relevant because - 68% of normal random variables are within 1 standard deviation of the mean - 95% of normal random variables are within 2 standard deviations of the mean - 99.7% of normal random variables are within 3 standard deviations of the mean.

If \(m=0\) and \(C=1\), we have a *standard normal*
random variable.

The normal distribution is so important because of the result of the Central Limit Theorem (CLT). The CLT states that sums and averages of independent, identically distributed random variables (with a finite variance) have an approximate normal distribution for large enough sample sizes.

Specifically, if \(X_1,X_2,\ldots,X_n\) are independent with
\(E[X_i] = m\) and \(Var[X_i] = V\) for all \(i\), then \[\overline{X} \stackrel{\cdot}{\sim} N(m,
V/n)\] and \[S = n\overline{X}
\stackrel{\cdot}{\sim} N(nm, nV)\] for sufficiently large \(n\) where \(\stackrel{\cdot}{\sim}\) means
*approximately distributed*.

One of the most important skills as a data analyst the
*abstraction* of the scientific problem into a standard
statistical method. Part of this abstraction involves identifying the
structure of the data in terms of the type of variables involved. The
type of variable involved are closely related with distributional
assumptions. The two main data types are categorical and continuous.

Categorical (or qualitative) data are non-numeric. There are two main types of categorical data: ordinal and nominal.

For ordinal data, order matters. An example of ordinal data is levels of education: high school, bachelor’s, master’s, PhD.

A very common type of ordinal data that often occurs on surveys is Likert scale data, e.g. strongly disagree, disagree, neither, agree, and strongly agree.

It is extremely common to analyze these data using a numeric scale, e.g. 1 to 5 to correspond to the ordinal scale. This is a strong assumption and not required as there are statistical methodologies that can analyze the data directly on the ordinal scale.

For nominal data, order does not matter. An example of ordinal data is type of fruit: apple, banana, orange, etc.

When a categorical variable only has two levels it is
*binary*. Some examples of categorical binary variables are

- success/failure
- true/false
- above/below

Typically we convert this categorical variable into a quantitative variable by assigning one of the two categories to 0 and the other to 1. Then, we will typically model binary variables using the Bernoulli distribution.

Quantitative are numeric. The two main types of quantitative data are continuous and count.

Continuous data are numeric data that can take on any value within a range. In practice, we are often limited in the precision with which we can measure something, but it is often easier to treat the resulting data as continuous.

The two most common types of continuous variables are those that are strictly positive and those that can be any real number.

The most common type of continuous variable is positive. For example, the following are all examples of continuous variables that are strictly positive

- length
- area
- mass
- temperature
- time/duration
- salary

A common approach to analyzing strictly positive variables is to take a logarithm in which case you now have an unrestricted continuous variable.

Many continuous variables can be any real number and are therefore unrestricted. Although less common than positive variables, these variables often result from taking differences of positive variables or taking logarithms of a single positive variable.

We will often transform variables to this unrestricted space in order to satisfy assumptions of the normal distribution.

Another type of numeric variable is a count variable. Counts are restricted to non-negative integers, i.e. whole numbers.

Binary numeric variables are simply 0 or 1. Note that any variable
can be transformed into a binary variable and therefore, in some sense,
this is the *universal* variable. Some examples of binary
variables are

Binary variables are typically modeled using the *Bernoulli*
distribution.

Count data that have a clear maximum occur frequently as the count of some successes in some number of attempts. For example,

- number of free throws made in 20 attempts
- number of houses in my neighborhood damaged by a storm out of the 65 houses
- number of correct answers on a test out of the 99 students

If we can assume independence amongst all the attempts, then we can
utilize a *binomial* distribution. If we cannot, then we will
typically use a *Bernoulli* distribution with a regression
structure, e.g. logistic regression.

Often counts have no clear or obvious maximum. For example,

- number of free throws made in a game
- number of students enrolled
- number of books on my shelf

In these situations, we commonly use a *Poisson* distribution
to model these counts. Recall that the Poisson distribution has a
variance equal to the mean. If the variance is greater than the mean, we
will typically use the negative binomial distribution. (This is harder
than just calculating the mean and calculating the variance.)

Although in theory any data we measure can be categorized as above.
Sometimes we don’t actual observe the data that could have been observed
because we end the experiment early or we only record data every so
often. In these situations the data become *censored*. The two
most common types of censoring are

- right-censored: we know the data are larger than some number
- interval-censored: we know the data are in some range

When the censoring is large or occurs for a large number of the
observations, then we need to be careful with our analyses. These types
of analyses are discussed in *survival* or *reliability*
analysis.