To follow along, use the lab03 code.
Many of the standard probability distributions have functions in R to calculate:
The letter after each indicates the letter the function will start with for all distributions.
If the support of the random variable is a finite or countably infinite number of values, then the random variable is discrete. Discrete random variables have a probability mass function (pmf). This pmf gives the probability that a random variable will take on each value in its support. The cumulative distribution function (cdf) provides the probability the random variable is less than or equal to a particular value. The quantile function is the inverse of the cumulative distribution function, i.e. you provide a probability and the quantile function returns the value of the random variable such that the cdf will return that probability. (This is not very useful for discrete distributions.)
Whenever you repeat an experiment and you assume
Then, if you record the number of success out of the total, you have a binomial distribution.
For example, consider an experiment with probability of success of 0.7 and 13 trials, i.e. \(X\sim Bin(13,0.7)\).
We can use the pmf to calculate the probability of a particular outcome of the experiment. For example, what is the probability of seeing 6 successes? We can use the dbinom
function.
n <- 13
p <- 0.7
dbinom(6, size = n, prob = p)
## [1] 0.0441524
The entire pmf is
x <- 0:n
plot(x, dbinom(x, size = n, prob = p), main = "Probability mass function for Bin(13,0.7)")
If we want to calculate the probability of observing an outcome less than or equal to a particular value, we can use the cumulative distribution function. For example, what is the probability of observing 9 or fewer successes?
pbinom(9, size = n, prob = p)
## [1] 0.5793944
Here is the entire cdf.
plot(x, pbinom(x, size = n, prob = p), type="s", main = "Cumulative distribution function for Bin(13,0.7)")
For a discrete random variable, the cdf is a step function since the function jumps whenever it comes across a value in the support for the random variable.
Finally, we can draw random values from this binomial distribution using the rbinom
function.
draws <- rbinom(100, size = n, prob = p) # draw 100
brks <- (0:(n+1)) - 0.5
hist(draws, breaks = brks, main = "Random draws from Bin(13,0.7)")
Rather than having the number of draws, we often want the percentage of draws. For reference, we can add the true probabilities using the pmf (shown in red).
hist(draws, breaks = brks, probability = TRUE)
points(x, dbinom(x, size = n, prob = p), col="red")
Plot the pmf and cdf function for the binomial distribution with probability of success 0.25 and 39 trials, i.e. \(X\sim Bin(39,0.25)\). Then sample 999 random binomials with 39 trials and probability of success 0.25 and plot them on a histogram with the true probability mass function.
n <- 39
p <- 0.25
x <- 0:n
plot(x, dbinom(x, size = n, prob = p),
main = 'Probability mass function for Bin(39,0.25)')
plot(x, pbinom(x, size = n, prob = p), type = 's',
main = 'Cumulative distribution function for Bin(39, 0.25)')
draws <- rbinom(999, size = n, prob = p)
brks <- (0:(n+1)) - 0.5
hist(draws, breaks = brks, probability = TRUE, main = "Random draws from Bin(39,0.25)")
points(x, dbinom(x, size = n, prob = p), col="red")
While the binomial distribution has an upper limit (\(n\)), we sometimes run an experiment and are counting successes without any technical upper limit. These experiments are usually run for some amount of time or over some amount of space or both. For example,
When there is no technical upper limit (even if the probability of more is extremely small), then a Poisson distribution can be used. The Poisson distribution has a single parameter, the rate that describes, on average, how many of the things are expected to be observed. We write \(X\sim Po(\lambda)\) where \(\lambda\) is the rate parameter.
Suppose we record the number of network failures in a day and on average we see 2 failures per day. The number of network failures in a day has no upper limit, so we’ll use the Poisson distribution.
Here is the pmf:
rate <- 2
x <- 0:10 # with no upper limit we need to decide on an upper limit
plot(x, dpois(x, lambda = rate), main = "Probability mass function for Po(2)")
Here is the cdf:
plot(x, ppois(x, lambda = rate), type="s", main = "Cumulative distribution function for Po(2)")
And random draws
draws <- rpois(100, lambda = rate)
hist(draws, breaks = (0:(max(draws)+1)) - 0.5, probability = TRUE, main = "Random draws from Po(2)")
points(x, dpois(x, lambda = rate), col="red")
Plot the pmf and cdf function for a Poisson distribution with rate 23, i.e. \(X\sim Po(23)\). Then sample 999 Poisson random variables with rate 23 and plot them on a histogram with the true probability mass function.
rate <- 23
x <- 5:40 # with no upper limit we need to decide on an upper limit
plot(x, dpois(x, lambda = rate),
main = 'Probability mass function for Po(23)')
plot(x, ppois(x, lambda = rate), type="s",
main = 'Cumulative distribution function for Po(23)')
draws <- rpois(999, lambda = rate)
hist(draws, breaks = (min(draws):(max(draws)+1)) - 0.5, probability = TRUE,
main = "Random draws from Po(23)")
points(x, dpois(x, lambda = rate), col="red")
In constrast to discrete random variables, continuous random variables can take on an uncountably infinite number of values. The easiest way for this to happen is that the random variable can take on any value between two specified values (and infinity counts), i.e. an interval.
Continuous random variables have a probability density function (pdf) instead of a pmf. When integrated from a to b, this pdf gives the probability the random variable will take on a value between a and b. Continuous random variables still have a cdf, quantile function, and random generator that all still have the same interpretation.
The simplest continuous random variable is the uniform random variable. If \(Y\) is a random variable and it is uniformly distributed between a and b, then we write \(Y\sim Unif(a,b)\) and this means that \(Y\) can take on any value between a and b with equal probability.
The probability density function for a uniform random variable is zero outside of a and b (indicating that the random variable cannot take values below a or above b) and is constant at a value of 1/(b-a) from a to b.
a <- 0
b <- 1
# The curve function expects you to give a function of `x` and then it
# (internally) creates a sequence of values from `from` and to `to` and creates
# plots similar to what we had before, but using a line rather than points.
curve(dunif(x, min = a, max = b), from = -1, to = 2,
xlab='y', ylab='f(y)', main='Probability density function for Unif(0,1)')
The cumulative distribution function is the integral from negative infinite up to y of the probability density function. Thus it indicates the probability the random variables will take on values less than y, i.e. \(P(Y<y)\). Since the probability density function for the uniform is constant, the integral is simply zero from negative infinite up to a, then a straight line from (a,0) up to (b,1) and then constant after that.
curve(punif(x, min = a, max = b), from = -1, to = 2,
xlab='y', ylab='F(y)', main='Cumulative distribution function for Unif(0,1)')
The quantile function is the inverse of the cumulative distribution function. For a given probability p, it finds the value such that the probability the random variable is below that value is p. That is it finds the value x in the expression \(P(X < x) = p\).
curve(qunif(x, min = a, max = b), from = 0, to = 1,
xlab='p', ylab='F^{-1}(p)', main='Quantile function for Unif(0,1)')
To draw random values from the distribution, use the r
version of the function. For instance, here is 100 random Unif(a,b) draws represented as a histogram.
random_uniforms <- runif(100, min = a, max = b)
hist(random_uniforms, probability = TRUE, main = "Random draws from Unif(0,1)")
curve(dunif(x, min = a, max = b), add = TRUE, col="red")
Plot the pdf, cdf, and quantile function for a uniform distribution on the interval (13, 65), i.e. \(X\sim Unif(13,65)\). Then sample 999 random uniforms on the interval (13,65) and plot a histogram of these draws with the probability density function.
a <- 13
b <- 65
opar = par(mfrow=c(2,2)) # Create a 2x2 grid of figures
curve(dunif(x, min = a, max = b), from = a-1, to = b+1,
xlab='y', ylab='f(y)', main='Probability density function for Unif(13,65)')
curve(punif(x, min = a, max = b), from = a-1, to = b+1,
xlab='y', ylab='f(y)', main='Cumulative distribution function for Unif(13,65)')
curve(qunif(x, min = a, max = b), from = 0, to = 1,
xlab='y', ylab='f(y)', main='Quantile function for Unif(13,65)')
random_uniforms <- runif(999, min = a, max = b)
hist(random_uniforms, probability = TRUE, main = "Random draws from Unif(13,65)")
curve(dunif(x, min = a, max = b), add=TRUE, col="red")
par(opar)
The most important distribution for this course is the normal (Gaussian) distribution. The normal distribution has two parameters: the mean (\(\mu\)) and the variance (\(\sigma^2\)) and we write \(X\sim N(\mu,\sigma^2)\). If \(\mu=0\) and \(\sigma^2=1\), we call the associated random variable a standard normal. The probability density function for the normal distribution is the well-known bell-shaped curve.
mu <- 0
sigma <- 1 # standard deviation
curve(dnorm(x, mean = mu, sd = sigma), # notice the 3rd argument is the sd
from = -4, to = 4,
main = "PDF for a standard normal")
The quantile function has a sigmoid shape.
curve(pnorm(x, mean = mu, sd = sigma),
from = -4, to = 4,
main = "CDF for a standard normal",
ylab = "F(x)")
The quantile function will be used later in the semester to help construct confidence/credible intervals.
curve(qnorm(x, mean = mu, sd = sigma),
from = 0, to = 1,
main = "Quantile function for a standard normal")
Then we can take random draws
draws <- rnorm(100, mean = mu, sd = sigma)
hist(draws, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sigma), add = TRUE, col = "red")
Plot the pdf, cdf, and quantile function for a normal distribution with mean -4 and variance 3, i.e. \(X\sim N(-4,3)\). Then sample 999 random N(-4,3) and plot a histogram of these draws with the probability density function.
mu <- -4
sigma <- sqrt(3) # standard deviation!!!
opar = par(mfrow=c(2,2))
curve(dnorm(x, mean = mu, sd = sigma),
from = mu-4*sigma, to = mu+4*sigma,
main = "PDF for a normal")
curve(pnorm(x, mean = mu, sd = sigma),
from = mu-4*sigma, to = mu+4*sigma,
main = "CDF for a normal")
curve(qnorm(x, mean = mu, sd = sigma),
from = 0, to = 1,
main = "Quantile function for a normal")
draws <- rnorm(999, mean = mu, sd = sigma)
hist(draws, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sigma), add = TRUE, col = "red")
par(opar)
For any normal random variable, determine the probability it is
Since this is apparently for any normal, we will just use the standard normal. So if \(X\sim N(\mu, \sigma^2)\), we must determine \(P(|X|<c) = P(-c < X < c) = P(X<c) - P(X< -c)\) for \(c\) in 1,2,3.
# Try any values for mu and sigma
mu <- 0
sigma <- 1
pnorm(mu+1*sigma, mean = mu, sd = sigma) - pnorm(mu-1*sigma, mean = mu, sd = sigma)
## [1] 0.6826895
pnorm(mu+2*sigma, mean = mu, sd = sigma) - pnorm(mu-2*sigma, mean = mu, sd = sigma)
## [1] 0.9544997
pnorm(mu+3*sigma, mean = mu, sd = sigma) - pnorm(mu-3*sigma, mean = mu, sd = sigma)
## [1] 0.9973002
Suppose a manufacturing line produces temperature sensors and has a sensor failure proportion equal to 1.5%. In a given day, the plant tests 70 sensors. If we assume the sensors are independent (given the failure proportion), then \(X\sim Bin(70,0.015)\).
Answer the following questions:
The second question is tricky, because we need \(P(X\ge 3) = 1-P(X<3) = 1-P(X\le 2)\).
n <- 70
p <- 0.015
dbinom(0, size = n, prob = p)
## [1] 0.3471652
1-pbinom(2, size = n, prob = p)
## [1] 0.08833028