R code

library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("Sleuth3")
set.seed(20230319)

Binomial analyses

Recall that if \(X_i \stackrel{ind}{\sim} Ber(\theta)\), then \[ Y = X_1 + X_2 + \cdots + X_n \sim Bin(n,\theta)$. \] Thus \(Y\) is the number of successes out of \(n\) attempts when the attempts are independent and they have a common probability of success \(\theta\) (which may be unknown).

One group

If there is a single binomial population, then we basically estimate \(\theta\).

binom.test()

Modified from :

## Conover (1971), p. 97f.
## Under (the assumption of) simple Mendelian inheritance, a cross
##  between plants of two particular genotypes produces progeny 1/4 of
##  which are "dwarf" and 3/4 of which are "giant", respectively.
##  In an experiment to determine if this assumption is reasonable, a
##  cross results in progeny having 243 dwarf and 682 giant plants.
##  If "giant" is taken as success, the null hypothesis is that p =
##  3/4 and the alternative that p != 3/4.
successes <- 682
failures  <- 243

binom.test(c(successes, failures), p = 3/4)
## 
##  Exact binomial test
## 
## data:  c(successes, failures)
## number of successes = 682, number of trials = 925, p-value = 0.3825
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
##  0.7076683 0.7654066
## sample estimates:
## probability of success 
##              0.7372973

Alternatively, you can use

y <- successes
n <- successes + failures
binom.test(y, n, p = 3/4)
## 
##  Exact binomial test
## 
## data:  y and n
## number of successes = 682, number of trials = 925, p-value = 0.3825
## alternative hypothesis: true probability of success is not equal to 0.75
## 95 percent confidence interval:
##  0.7076683 0.7654066
## sample estimates:
## probability of success 
##              0.7372973

prop.test()

Asymptotic approach when the sample size is large and the true probability is not too close to 0 or 1.

prop.test(matrix(c(successes, failures), ncol = 2), p = 3/4)
## 
##  1-sample proportions test with continuity correction
## 
## data:  matrix(c(successes, failures), ncol = 2), null probability 3/4
## X-squared = 0.72973, df = 1, p-value = 0.393
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
##  0.7074391 0.7651554
## sample estimates:
##         p 
## 0.7372973

or, more commonly,

prop.test(y, n, p = 3/4)
## 
##  1-sample proportions test with continuity correction
## 
## data:  y out of n, null probability 3/4
## X-squared = 0.72973, df = 1, p-value = 0.393
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
##  0.7074391 0.7651554
## sample estimates:
##         p 
## 0.7372973

Additional arguments

args(prop.test)
## function (x, n, p = NULL, alternative = c("two.sided", "less", 
##     "greater"), conf.level = 0.95, correct = TRUE) 
## NULL
prop.test(y, n, p = 3/4, alternative = "greater", conf.level = 0.9, correct = FALSE)
## 
##  1-sample proportions test without continuity correction
## 
## data:  y out of n, null probability 3/4
## X-squared = 0.79604, df = 1, p-value = 0.8139
## alternative hypothesis: true p is greater than 0.75
## 90 percent confidence interval:
##  0.7183437 1.0000000
## sample estimates:
##         p 
## 0.7372973

Note: I would ALWAYS use .

Similarly for .

args(binom.test)
## function (x, n, p = 0.5, alternative = c("two.sided", "less", 
##     "greater"), conf.level = 0.95) 
## NULL
binom.test(y, n, p = 3/4, alternative = "greater", conf.level = 0.9)
## 
##  Exact binomial test
## 
## data:  y and n
## number of successes = 682, number of trials = 925, p-value = 0.8241
## alternative hypothesis: true probability of success is greater than 0.75
## 90 percent confidence interval:
##  0.7178455 1.0000000
## sample estimates:
## probability of success 
##              0.7372973

Two groups

In a two population binomial situation, we have two separate binomials.

\[ Y_1 \sim Bin(n_1, \theta_1) \qquad \mbox{and} \qquad Y_2 \sim Bin(n_2, \theta_2) \] with \(Y_1\) and \(Y_2\) being independent.

Independent here means that, if you know \(\theta_1\) and \(\theta_2\), knowing \(Y_1\) gives you no (additional) information about \(Y_2\) (and vice versa). The most common situation for independence to be violated is when the data are or and we will talk about this in the contingency table section below.

When there are two populations, we are often interested in understanding a causal relationship between observed differences in the probability of success and the two populations.

Observational data

Remember that with observational data we cannot determine a cause-and-effect relationships. But there is an even more common situation called Simpson’s Paradox that can hamper our understanding of observational data as we will see with this UC-Berkeley Admissions data.

Let’s take a look at the admission rates by gender at UC-Berkeley by calculating the total number of applicants and total number of admits by gender across all departments.

admissions <- UCBAdmissions %>%
  as.data.frame() %>%
  group_by(Admit, Gender) %>%
  summarize(
    n = sum(Freq),
    .groups = "drop"
  ) %>%
  pivot_wider(names_from = "Admit", values_from = "n") %>%
  mutate(
    y = Admitted, 
    n = Admitted + Rejected,
    p = y/n)

admissions
## # A tibble: 2 × 6
##   Gender Admitted Rejected     y     n     p
##   <fct>     <dbl>    <dbl> <dbl> <dbl> <dbl>
## 1 Male       1198     1493  1198  2691 0.445
## 2 Female      557     1278   557  1835 0.304

Because the number of individuals is large and the probability is not clsoe to 0 or 1, we’ll go ahead and use . With two populations, the default null hypothesis is that \[ H_0: \theta_1 = \theta_2. \]

prop.test(admissions$y, admissions$n)
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  admissions$y out of admissions$n
## X-squared = 91.61, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1129887 0.1703022
## sample estimates:
##    prop 1    prop 2 
## 0.4451877 0.3035422

This p-value suggests that the null model which assumes independence (both across and within binomial samples) and equal probabilities is unlikely to be true. As an analyst, we need to make a decision about whether the independences assumptions are reasonable.

We’ll come back to this example in a bit.

Experimental data

With experimental data, we can determine causal relationships. In these data, we randomly assign participants to receive 1,000 mg of Vitamin C per day through the cold season or placebo.

case1802 %>%
  mutate(p = NoCold / (Cold+NoCold))
##   Treatment Cold NoCold         p
## 1   Placebo  335     76 0.1849148
## 2      VitC  302    105 0.2579853
prop.test(case1802[,c("Cold","NoCold")]) # I feel like this should work
## Error in prop.test(case1802[, c("Cold", "NoCold")]): argument "n" is missing, with no default
prop.test(cbind(case1802$Cold, case1802$NoCold))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  cbind(case1802$Cold, case1802$NoCold)
## X-squared = 5.9196, df = 1, p-value = 0.01497
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.01391972 0.13222111
## sample estimates:
##    prop 1    prop 2 
## 0.8150852 0.7420147

Remember our independence assumptions. As the analyst, we don’t have enough information about these data to assess the independence assumption.

More than two groups

When we have more than two populations, we

Let’s return to our UC-Berkeley admissions data. Rather than combine across departments, let’s compare the probability of acceptance across departments. First, let’s ignore gender.

admissions_by_department <- UCBAdmissions %>%
  as.data.frame() %>%
  group_by(Dept, Admit) %>%
  summarize(
    n = sum(Freq),
    .groups = "drop"  
  ) %>%
  pivot_wider(names_from = "Admit", values_from = "n") %>%
  mutate(Total = Admitted + Rejected)

Contingency table

admissions_by_department %>%
  mutate(p = Admitted / Total)
## # A tibble: 6 × 5
##   Dept  Admitted Rejected Total      p
##   <fct>    <dbl>    <dbl> <dbl>  <dbl>
## 1 A          601      332   933 0.644 
## 2 B          370      215   585 0.632 
## 3 C          322      596   918 0.351 
## 4 D          269      523   792 0.340 
## 5 E          147      437   584 0.252 
## 6 F           46      668   714 0.0644
ggplot(admissions_by_department,
       aes(x = Dept, y = Admitted / Total, size = Total)) + 
  geom_point() + 
  ylim(0,1)

Testing equality

Testing for admission probability

prop.test(admissions_by_department$Admitted,
          admissions_by_department$Total)
## 
##  6-sample test for equality of proportions without continuity correction
## 
## data:  admissions_by_department$Admitted out of admissions_by_department$Total
## X-squared = 778.91, df = 5, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##     prop 1     prop 2     prop 3     prop 4     prop 5     prop 6 
## 0.64415863 0.63247863 0.35076253 0.33964646 0.25171233 0.06442577

Regression

This isn’t really the question we are interested in. We are interested in the probability of admission by gender controlling (or adjusting) for department.

admissions_by_dept_and_gender <- UCBAdmissions %>%
  as.data.frame() %>%
  pivot_wider(names_from = "Admit", values_from = "Freq") %>%
  # group_by(Dept, Gender) %>%
  mutate(
    Total = Admitted + Rejected,
    p_hat = Admitted / Total,
    lcl   = p_hat - qnorm(.975)*sqrt(p_hat*(1-p_hat)/Total),
    ucl   = p_hat + qnorm(.975)*sqrt(p_hat*(1-p_hat)/Total)
  )
admissions_by_dept_and_gender 
## # A tibble: 12 × 8
##    Gender Dept  Admitted Rejected Total  p_hat    lcl    ucl
##    <fct>  <fct>    <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
##  1 Male   A          512      313   825 0.621  0.587  0.654 
##  2 Female A           89       19   108 0.824  0.752  0.896 
##  3 Male   B          353      207   560 0.630  0.590  0.670 
##  4 Female B           17        8    25 0.68   0.497  0.863 
##  5 Male   C          120      205   325 0.369  0.317  0.422 
##  6 Female C          202      391   593 0.341  0.302  0.379 
##  7 Male   D          138      279   417 0.331  0.286  0.376 
##  8 Female D          131      244   375 0.349  0.301  0.398 
##  9 Male   E           53      138   191 0.277  0.214  0.341 
## 10 Female E           94      299   393 0.239  0.197  0.281 
## 11 Male   F           22      351   373 0.0590 0.0351 0.0829
## 12 Female F           24      317   341 0.0704 0.0432 0.0975
ggplot(admissions_by_dept_and_gender, 
       aes(x = Dept, y = Admitted / Total, ymin = lcl, ymax = ucl,
           color = Gender)) +
  geom_point(aes(size = Total), position = position_dodge(width = 0.1)) +
  geom_linerange(position = position_dodge(width = 0.1))

m <- glm(cbind(Admitted, Rejected) ~ Dept + Gender, 
        data = admissions_by_dept_and_gender,
        family = "binomial")

summary(m)
## 
## Call:
## glm(formula = cbind(Admitted, Rejected) ~ Dept + Gender, family = "binomial", 
##     data = admissions_by_dept_and_gender)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
## -1.2487   3.7189  -0.0560   0.2706   1.2533  -0.9243   0.0826  -0.0858  
##       9       10       11       12  
##  1.2205  -0.8509  -0.2076   0.2052  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.58205    0.06899   8.436   <2e-16 ***
## DeptB        -0.04340    0.10984  -0.395    0.693    
## DeptC        -1.26260    0.10663 -11.841   <2e-16 ***
## DeptD        -1.29461    0.10582 -12.234   <2e-16 ***
## DeptE        -1.73931    0.12611 -13.792   <2e-16 ***
## DeptF        -3.30648    0.16998 -19.452   <2e-16 ***
## GenderFemale  0.09987    0.08085   1.235    0.217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 877.056  on 11  degrees of freedom
## Residual deviance:  20.204  on  5  degrees of freedom
## AIC: 103.14
## 
## Number of Fisher Scoring iterations: 4