library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("Sleuth3")
set.seed(20230319)
Recall that if \(X_i \stackrel{ind}{\sim} Ber(\theta)\), then \[ Y = X_1 + X_2 + \cdots + X_n \sim Bin(n,\theta)$. \] Thus \(Y\) is the number of successes out of \(n\) attempts when the attempts are independent and they have a common probability of success \(\theta\) (which may be unknown).
If there is a single binomial population, then we basically estimate \(\theta\).
Modified from :
## Conover (1971), p. 97f.
## Under (the assumption of) simple Mendelian inheritance, a cross
## between plants of two particular genotypes produces progeny 1/4 of
## which are "dwarf" and 3/4 of which are "giant", respectively.
## In an experiment to determine if this assumption is reasonable, a
## cross results in progeny having 243 dwarf and 682 giant plants.
## If "giant" is taken as success, the null hypothesis is that p =
## 3/4 and the alternative that p != 3/4.
successes <- 682
failures <- 243
binom.test(c(successes, failures), p = 3/4)
##
## Exact binomial test
##
## data: c(successes, failures)
## number of successes = 682, number of trials = 925, p-value = 0.3825
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
## 0.7076683 0.7654066
## sample estimates:
## probability of success
## 0.7372973
Alternatively, you can use
y <- successes
n <- successes + failures
binom.test(y, n, p = 3/4)
##
## Exact binomial test
##
## data: y and n
## number of successes = 682, number of trials = 925, p-value = 0.3825
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
## 0.7076683 0.7654066
## sample estimates:
## probability of success
## 0.7372973
Asymptotic approach when the sample size is large and the true probability is not too close to 0 or 1.
prop.test(matrix(c(successes, failures), ncol = 2), p = 3/4)
##
## 1-sample proportions test with continuity correction
##
## data: matrix(c(successes, failures), ncol = 2), null probability 3/4
## X-squared = 0.72973, df = 1, p-value = 0.393
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
## 0.7074391 0.7651554
## sample estimates:
## p
## 0.7372973
or, more commonly,
prop.test(y, n, p = 3/4)
##
## 1-sample proportions test with continuity correction
##
## data: y out of n, null probability 3/4
## X-squared = 0.72973, df = 1, p-value = 0.393
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
## 0.7074391 0.7651554
## sample estimates:
## p
## 0.7372973
args(prop.test)
## function (x, n, p = NULL, alternative = c("two.sided", "less",
## "greater"), conf.level = 0.95, correct = TRUE)
## NULL
prop.test(y, n, p = 3/4, alternative = "greater", conf.level = 0.9, correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: y out of n, null probability 3/4
## X-squared = 0.79604, df = 1, p-value = 0.8139
## alternative hypothesis: true p is greater than 0.75
## 90 percent confidence interval:
## 0.7183437 1.0000000
## sample estimates:
## p
## 0.7372973
Note: I would ALWAYS use .
Similarly for .
args(binom.test)
## function (x, n, p = 0.5, alternative = c("two.sided", "less",
## "greater"), conf.level = 0.95)
## NULL
binom.test(y, n, p = 3/4, alternative = "greater", conf.level = 0.9)
##
## Exact binomial test
##
## data: y and n
## number of successes = 682, number of trials = 925, p-value = 0.8241
## alternative hypothesis: true probability of success is greater than 0.75
## 90 percent confidence interval:
## 0.7178455 1.0000000
## sample estimates:
## probability of success
## 0.7372973
In a two population binomial situation, we have two separate binomials.
\[ Y_1 \sim Bin(n_1, \theta_1) \qquad \mbox{and} \qquad Y_2 \sim Bin(n_2, \theta_2) \] with \(Y_1\) and \(Y_2\) being independent.
Independent here means that, if you know \(\theta_1\) and \(\theta_2\), knowing \(Y_1\) gives you no (additional) information about \(Y_2\) (and vice versa). The most common situation for independence to be violated is when the data are or and we will talk about this in the contingency table section below.
When there are two populations, we are often interested in understanding a causal relationship between observed differences in the probability of success and the two populations.
Remember that with observational data we cannot determine a cause-and-effect relationships. But there is an even more common situation called Simpson’s Paradox that can hamper our understanding of observational data as we will see with this UC-Berkeley Admissions data.
Let’s take a look at the admission rates by gender at UC-Berkeley by calculating the total number of applicants and total number of admits by gender across all departments.
admissions <- UCBAdmissions %>%
as.data.frame() %>%
group_by(Admit, Gender) %>%
summarize(
n = sum(Freq),
.groups = "drop"
) %>%
pivot_wider(names_from = "Admit", values_from = "n") %>%
mutate(
y = Admitted,
n = Admitted + Rejected,
p = y/n)
admissions
## # A tibble: 2 × 6
## Gender Admitted Rejected y n p
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male 1198 1493 1198 2691 0.445
## 2 Female 557 1278 557 1835 0.304
Because the number of individuals is large and the probability is not clsoe to 0 or 1, we’ll go ahead and use . With two populations, the default null hypothesis is that \[ H_0: \theta_1 = \theta_2. \]
prop.test(admissions$y, admissions$n)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: admissions$y out of admissions$n
## X-squared = 91.61, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.1129887 0.1703022
## sample estimates:
## prop 1 prop 2
## 0.4451877 0.3035422
This p-value suggests that the null model which assumes independence (both across and within binomial samples) and equal probabilities is unlikely to be true. As an analyst, we need to make a decision about whether the independences assumptions are reasonable.
We’ll come back to this example in a bit.
With experimental data, we can determine causal relationships. In these data, we randomly assign participants to receive 1,000 mg of Vitamin C per day through the cold season or placebo.
case1802 %>%
mutate(p = NoCold / (Cold+NoCold))
## Treatment Cold NoCold p
## 1 Placebo 335 76 0.1849148
## 2 VitC 302 105 0.2579853
prop.test(case1802[,c("Cold","NoCold")]) # I feel like this should work
## Error in prop.test(case1802[, c("Cold", "NoCold")]): argument "n" is missing, with no default
prop.test(cbind(case1802$Cold, case1802$NoCold))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: cbind(case1802$Cold, case1802$NoCold)
## X-squared = 5.9196, df = 1, p-value = 0.01497
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.01391972 0.13222111
## sample estimates:
## prop 1 prop 2
## 0.8150852 0.7420147
Remember our independence assumptions. As the analyst, we don’t have enough information about these data to assess the independence assumption.
When we have more than two populations, we
Let’s return to our UC-Berkeley admissions data. Rather than combine across departments, let’s compare the probability of acceptance across departments. First, let’s ignore gender.
admissions_by_department <- UCBAdmissions %>%
as.data.frame() %>%
group_by(Dept, Admit) %>%
summarize(
n = sum(Freq),
.groups = "drop"
) %>%
pivot_wider(names_from = "Admit", values_from = "n") %>%
mutate(Total = Admitted + Rejected)
admissions_by_department %>%
mutate(p = Admitted / Total)
## # A tibble: 6 × 5
## Dept Admitted Rejected Total p
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 A 601 332 933 0.644
## 2 B 370 215 585 0.632
## 3 C 322 596 918 0.351
## 4 D 269 523 792 0.340
## 5 E 147 437 584 0.252
## 6 F 46 668 714 0.0644
ggplot(admissions_by_department,
aes(x = Dept, y = Admitted / Total, size = Total)) +
geom_point() +
ylim(0,1)
Testing for admission probability
prop.test(admissions_by_department$Admitted,
admissions_by_department$Total)
##
## 6-sample test for equality of proportions without continuity correction
##
## data: admissions_by_department$Admitted out of admissions_by_department$Total
## X-squared = 778.91, df = 5, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5 prop 6
## 0.64415863 0.63247863 0.35076253 0.33964646 0.25171233 0.06442577
This isn’t really the question we are interested in. We are interested in the probability of admission by gender controlling (or adjusting) for department.
admissions_by_dept_and_gender <- UCBAdmissions %>%
as.data.frame() %>%
pivot_wider(names_from = "Admit", values_from = "Freq") %>%
# group_by(Dept, Gender) %>%
mutate(
Total = Admitted + Rejected,
p_hat = Admitted / Total,
lcl = p_hat - qnorm(.975)*sqrt(p_hat*(1-p_hat)/Total),
ucl = p_hat + qnorm(.975)*sqrt(p_hat*(1-p_hat)/Total)
)
admissions_by_dept_and_gender
## # A tibble: 12 × 8
## Gender Dept Admitted Rejected Total p_hat lcl ucl
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male A 512 313 825 0.621 0.587 0.654
## 2 Female A 89 19 108 0.824 0.752 0.896
## 3 Male B 353 207 560 0.630 0.590 0.670
## 4 Female B 17 8 25 0.68 0.497 0.863
## 5 Male C 120 205 325 0.369 0.317 0.422
## 6 Female C 202 391 593 0.341 0.302 0.379
## 7 Male D 138 279 417 0.331 0.286 0.376
## 8 Female D 131 244 375 0.349 0.301 0.398
## 9 Male E 53 138 191 0.277 0.214 0.341
## 10 Female E 94 299 393 0.239 0.197 0.281
## 11 Male F 22 351 373 0.0590 0.0351 0.0829
## 12 Female F 24 317 341 0.0704 0.0432 0.0975
ggplot(admissions_by_dept_and_gender,
aes(x = Dept, y = Admitted / Total, ymin = lcl, ymax = ucl,
color = Gender)) +
geom_point(aes(size = Total), position = position_dodge(width = 0.1)) +
geom_linerange(position = position_dodge(width = 0.1))
m <- glm(cbind(Admitted, Rejected) ~ Dept + Gender,
data = admissions_by_dept_and_gender,
family = "binomial")
summary(m)
##
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Dept + Gender, family = "binomial",
## data = admissions_by_dept_and_gender)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## -1.2487 3.7189 -0.0560 0.2706 1.2533 -0.9243 0.0826 -0.0858
## 9 10 11 12
## 1.2205 -0.8509 -0.2076 0.2052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.58205 0.06899 8.436 <2e-16 ***
## DeptB -0.04340 0.10984 -0.395 0.693
## DeptC -1.26260 0.10663 -11.841 <2e-16 ***
## DeptD -1.29461 0.10582 -12.234 <2e-16 ***
## DeptE -1.73931 0.12611 -13.792 <2e-16 ***
## DeptF -3.30648 0.16998 -19.452 <2e-16 ***
## GenderFemale 0.09987 0.08085 1.235 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 877.056 on 11 degrees of freedom
## Residual deviance: 20.204 on 5 degrees of freedom
## AIC: 103.14
##
## Number of Fisher Scoring iterations: 4