R code

library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("Sleuth3")
set.seed(20230319)

Normal analyses

Normal analyses are (typically) relevant for independent, continuous observations. These observations may need to be transformed to satisfy assumptions of normality. Pedagogically, we often begin with a single group with a known variance. A known variance is uncommon in practice and thus, here, we skip directly to the unknown variance situation.

One group

The single group normal analysis begins with the assumptions \[ Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2) \] for \(i=1,\ldots,n\).

An alternative way to write this model is \[ Y_i = \mu + \epsilon_i, \qquad \epsilon_i \stackrel{ind}{\sim} N(0,\sigma^2) \]

which has the following assumptions

  • errors are independent,
  • errors are normally distributed,
  • errors have a constant variance,

and

  • common mean.

Summary statistics

creativity <- case0101 %>%
  filter(Treatment == "Intrinsic")
summary(creativity)
##      Score           Treatment 
##  Min.   :12.00   Extrinsic: 0  
##  1st Qu.:17.43   Intrinsic:24  
##  Median :20.40                 
##  Mean   :19.88                 
##  3rd Qu.:22.30                 
##  Max.   :29.70
creativity %>%
  summarize(
    n = n(),
    mean = mean(Score),
    sd = sd(Score)
  )
##    n     mean       sd
## 1 24 19.88333 4.439513
ggplot(creativity, aes(x = Score)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(creativity, aes(x = Score)) + 
  geom_boxplot(outlier.shape = NA)

Test and confidence interval

In a t-test, we are typically interested in the hypothesis \[ H_0: \mu = \mu_0 \] for some value \(\mu_0\).

t.test(creativity$Score)
## 
##  One Sample t-test
## 
## data:  creativity$Score
## t = 21.941, df = 23, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.00869 21.75798
## sample estimates:
## mean of x 
##  19.88333

Alternatively, using a formula

t.test(Score ~ 1, data = creativity)
## 
##  One Sample t-test
## 
## data:  Score
## t = 21.941, df = 23, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.00869 21.75798
## sample estimates:
## mean of x 
##  19.88333

Additional arguments

t.test(creativity$Score, mu = 20, alternative = "greater", conf.level = 0.9)
## 
##  One Sample t-test
## 
## data:  creativity$Score
## t = -0.12874, df = 23, p-value = 0.5507
## alternative hypothesis: true mean is greater than 20
## 90 percent confidence interval:
##  18.68762      Inf
## sample estimates:
## mean of x 
##  19.88333

Two groups

When you have two groups, the assumption is

\[ Y_{ig} \stackrel{ind}{\sim} N(\mu_g, \sigma_g^2) \] for \(i = 1, \ldots, n_g\) and \(g=1,2\). Alternatively

\[ Y_{ig} = \mu_g + \epsilon_{ig}, \qquad \epsilon_{ig} \stackrel{ind}{\sim} N(0,\sigma_g^2) \] which assumes

  • errors are independent,
  • errors are normally distributed,
  • errors have constant variance within a group,

and

  • each group has its own constant mean.

Summary statistics

case0101 %>%
  group_by(Treatment) %>%
  summarize(
    n = n(),
    mean = mean(Score),
    sd = sd(Score),
    .groups = "drop"
  )
## # A tibble: 2 × 4
##   Treatment     n  mean    sd
##   <fct>     <int> <dbl> <dbl>
## 1 Extrinsic    23  15.7  5.25
## 2 Intrinsic    24  19.9  4.44

Test and confidence intervals

Typically we are interested in the hypothesis \[ H_0: \mu_1 = \mu_2 \qquad \mbox{or, equivalently,} \qquad H_0: \mu_1-\mu_2 = 0 \] and a confidence interval for the difference \(\mu_1-\mu_2\).

Unequal variances

To run the test, we first need to get the appropriate data

intrinsic <- case0101$Score[case0101$Treatment == "Intrinsic"]
extrinsic <- case0101$Score[case0101$Treatment == "Extrinsic"]

then we can run the test

t.test(extrinsic, intrinsic)
## 
##  Welch Two Sample t-test
## 
## data:  extrinsic and intrinsic
## t = -2.9153, df = 43.108, p-value = 0.005618
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.010803 -1.277603
## sample estimates:
## mean of x mean of y 
##  15.73913  19.88333

Alternatively, we can use the formula version and simplify our code

t.test(Score ~ Treatment, data = case0101)
## 
##  Welch Two Sample t-test
## 
## data:  Score by Treatment
## t = -2.9153, df = 43.108, p-value = 0.005618
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
##  -7.010803 -1.277603
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic 
##                15.73913                19.88333

Equal variances

Often, we will make the assumption that the two groups have equal variances.

t.test(Score ~ Treatment, data = case0101, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  Score by Treatment
## t = -2.9259, df = 45, p-value = 0.005366
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
##  -6.996973 -1.291432
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic 
##                15.73913                19.88333

Additional arguments

t.test(Score ~ Treatment, data = case0101, var.equal = TRUE,
       mu = 5, alternative = "greater", conf.level = 0.9)
## 
##  Two Sample t-test
## 
## data:  Score by Treatment
## t = -6.456, df = 45, p-value = 1
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is greater than 5
## 90 percent confidence interval:
##  -5.986439       Inf
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic 
##                15.73913                19.88333

Paired data

A paired analysis can be used when the data are have a natural grouping structure into a pair of observations. Within each pair of observations, one observation has one level of the explanatory variable while the other observation has the other level of the explanatory variable.

In the following example, the data are naturally paired because each set of observations is on a set of identical twins. One twin has schizophrenia while the other does not. These data compares the volume of the left hippocampus to understand its relationship to schizophrenia.

Summary statistics

summary(case0202)
##    Unaffected       Affected   
##  Min.   :1.250   Min.   :1.02  
##  1st Qu.:1.600   1st Qu.:1.31  
##  Median :1.770   Median :1.59  
##  Mean   :1.759   Mean   :1.56  
##  3rd Qu.:1.935   3rd Qu.:1.78  
##  Max.   :2.080   Max.   :2.02
schizophrenia <- case0202 %>%
  mutate(pair = LETTERS[1:n()]) %>%
  pivot_longer(-pair, names_to = "diagnosis", values_to = "volume")
ggplot(schizophrenia, aes(x = diagnosis, y = volume, group = pair)) + 
  geom_line()

It is important to connect these points between each twin because we are looking for a pattern amongst the comparisons.

Paired t-test

When the data are paired, we can perform a paired t-test.

t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)
## 
##  Paired t-test
## 
## data:  case0202$Unaffected and case0202$Affected
## t = 3.2289, df = 14, p-value = 0.006062
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.0667041 0.3306292
## sample estimates:
## mean difference 
##       0.1986667

This analysis is equivalent to taking the difference in each pair and performing a one-sample t-test.

t.test(case0202$Unaffected - case0202$Affected)
## 
##  One Sample t-test
## 
## data:  case0202$Unaffected - case0202$Affected
## t = 3.2289, df = 14, p-value = 0.006062
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.0667041 0.3306292
## sample estimates:
## mean of x 
## 0.1986667

More than two groups

If we have a single categorical variable with more than two groups, then \(G > 2\).

Summary statistics

case0501 %>%
  group_by(Diet) %>%
  summarize(
    n = n(),
    mean = mean(Lifetime),
    sd = sd(Lifetime),
    .groups = "drop"
  )
## # A tibble: 6 × 4
##   Diet      n  mean    sd
##   <fct> <int> <dbl> <dbl>
## 1 N/N85    57  32.7  5.13
## 2 N/R40    60  45.1  6.70
## 3 N/R50    71  42.3  7.77
## 4 NP       49  27.4  6.13
## 5 R/R50    56  42.9  6.68
## 6 lopro    56  39.7  6.99
ggplot(case0501, aes(x = Diet, y = Lifetime)) +
  geom_jitter(width = 0.1)

ANOVA

An ANOVA F-test considers the following null hypothesis \[ H_0: \mu_g = \mu \quad \forall\, g \]

m <- lm(Lifetime ~ Diet, data = case0501)
anova(m)
## Analysis of Variance Table
## 
## Response: Lifetime
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Diet        5  12734  2546.8  57.104 < 2.2e-16 ***
## Residuals 343  15297    44.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression

As a general approach, regression allows the response variable to depend on categorical and continuous variables in complex patterns. For the current discussion, we will limit our discussion to one categorical variable.

The general structure of a regression model is \[ Y_i \stackrel{ind}{\sim} N(\beta_0 + \beta_1X_{i,1} + \cdots + \beta_p X_{i,p}, \sigma^2) \] for \(i=1,\ldots,n\) or, equivalently, \[ Y_i = \beta_0 + \beta_1X_{i,1} + \cdots + \beta_p X_{i,p}, \qquad \epsilon_i \stackrel{ind}{\sim} N(0, \sigma^2). \] The assumptions a regression model are

  • errors are independent,
  • errors are normally distributed,
  • errors have constant variance,

and

  • the expected response, \(E[Y_i]\), depends on the explanatory variables according to a linear function (of the parameters).

Two groups

We can repeat the equal variance analysis using regression by encoding a dummy or (indicator) variable to represent the group. So let \[ X_{i,1} = \left\{ \begin{array}{ll} 1 & \mbox{if Treatment}_i \mbox{ is Intrinsic} \\ 0 & \mbox{otherwise} \end{array} \right. \]

m <- lm(Score ~ Treatment, data = case0101)

We can summarize the analysis in a few different ways.

Standard regression output

summary(m)
## 
## Call:
## lm(formula = Score ~ Treatment, data = case0101)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.739  -2.983   1.061   2.961   9.817 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          15.739      1.012  15.550  < 2e-16 ***
## TreatmentIntrinsic    4.144      1.416   2.926  0.00537 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.854 on 45 degrees of freedom
## Multiple R-squared:  0.1598, Adjusted R-squared:  0.1412 
## F-statistic: 8.561 on 1 and 45 DF,  p-value: 0.005366

Confidence intervals

confint(m)
##                        2.5 %    97.5 %
## (Intercept)        13.700570 17.777691
## TreatmentIntrinsic  1.291432  6.996973

ANOVA

anova(m)
## Analysis of Variance Table
## 
## Response: Score
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## Treatment  1  201.71 201.708  8.5608 0.005366 **
## Residuals 45 1060.29  23.562                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

More than two groups

If we have a single categorical variable with more than two groups, we can still utilize a regression model. To fit this regression model, we

  1. choose a reference level and
  2. create dummy variables for all other levels.

Generally, we let R do this for us where the reference level is the first level of the factor variable or first alphabetically for a character variable.

m <- lm(Lifetime ~ Diet, data = case0501)
summary(m)
## 
## Call:
## lm(formula = Lifetime ~ Diet, data = case0501)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.6912     0.8846  36.958  < 2e-16 ***
## DietN/R40    12.4254     1.2352  10.059  < 2e-16 ***
## DietN/R50     9.6060     1.1877   8.088 1.06e-14 ***
## DietNP       -5.2892     1.3010  -4.065 5.95e-05 ***
## DietR/R50    10.1945     1.2565   8.113 8.88e-15 ***
## Dietlopro     6.9945     1.2565   5.567 5.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

Two categorical variables

When we have two categorical variables we can utilize regression by

  1. choose a reference level for all categorical variables and
  2. constructing dummy variables for all other levels.

For simplicity, let’s start with the paired analysis.

m <- lm(volume ~ pair + diagnosis, schizophrenia)
summary(m)
## 
## Call:
## lm(formula = volume ~ pair + diagnosis, data = schizophrenia)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23567 -0.07558  0.00000  0.07558  0.23567 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.506e+00  1.231e-01  12.236 7.28e-09 ***
## pairB               -7.000e-02  1.685e-01  -0.415  0.68412    
## pairC               -9.000e-02  1.685e-01  -0.534  0.60163    
## pairD               -1.200e-01  1.685e-01  -0.712  0.48806    
## pairE                3.900e-01  1.685e-01   2.315  0.03633 *  
## pairF               -1.450e-01  1.685e-01  -0.861  0.40399    
## pairG                1.250e-01  1.685e-01   0.742  0.47044    
## pairH                1.150e-01  1.685e-01   0.682  0.50606    
## pairI               -7.500e-02  1.685e-01  -0.445  0.66305    
## pairJ                2.800e-01  1.685e-01   1.662  0.11879    
## pairK               -4.700e-01  1.685e-01  -2.789  0.01448 *  
## pairL                3.000e-02  1.685e-01   0.178  0.86124    
## pairM                4.250e-01  1.685e-01   2.522  0.02439 *  
## pairN                1.310e-15  1.685e-01   0.000  1.00000    
## pairO                4.200e-01  1.685e-01   2.493  0.02583 *  
## diagnosisUnaffected  1.987e-01  6.153e-02   3.229  0.00606 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1685 on 14 degrees of freedom
## Multiple R-squared:  0.8336, Adjusted R-squared:  0.6554 
## F-statistic: 4.677 on 15 and 14 DF,  p-value: 0.003132

The diagnosis line matches the paired analysis we did earlier:

t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)$conf.int
## [1] 0.0667041 0.3306292
## attr(,"conf.level")
## [1] 0.95
confint(m)[16,]
##     2.5 %    97.5 % 
## 0.0667041 0.3306292

Summary statistics

case1301 %>% 
  group_by(Block, Treat) %>%
  summarize(
    n = n(),
    mean = mean(Cover),
    sd = sd(Cover),
    .groups = "drop"
  )
## # A tibble: 48 × 5
##    Block Treat     n  mean    sd
##    <fct> <fct> <int> <dbl> <dbl>
##  1 B1    C         2  18.5 6.36 
##  2 B1    L         2   4   0    
##  3 B1    Lf        2   4   1.41 
##  4 B1    LfF       2   1.5 0.707
##  5 B1    f         2  17.5 9.19 
##  6 B1    fF        2  11.5 2.12 
##  7 B2    C         2  28.5 9.19 
##  8 B2    L         2   7.5 0.707
##  9 B2    Lf        2   4.5 2.12 
## 10 B2    LfF       2   4   1.41 
## # … with 38 more rows

While graphics have always been helpful, they now become even more helpful. We may need to play around with the plots to get one we like.

ggplot(case1301, aes(x = Block, y = Cover, shape = Treat, color = Treat)) +
  geom_point()

ggplot(case1301, aes(x = Treat, y = Cover, shape = Treat, color = Treat)) +
  geom_point() +
  facet_wrap(~Block, ncol = 4)

Analysis

m <- lm(Cover ~ Block + Treat, data = case1301)
anova(m) # Two-way ANOVA
## Analysis of Variance Table
## 
## Response: Cover
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Block      7  19106  2729.4  20.780 6.977e-16 ***
## Treat      5  23046  4609.1  35.092 < 2.2e-16 ***
## Residuals 83  10902   131.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m)
## 
## Call:
## lm(formula = Cover ~ Block + Treat, data = case1301)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.375  -5.812   0.625   5.438  26.125 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.875      4.217   7.795 1.66e-11 ***
## BlockB2        3.750      4.679   0.801 0.425132    
## BlockB3       31.750      4.679   6.786 1.59e-09 ***
## BlockB4       45.500      4.679   9.725 2.32e-15 ***
## BlockB5       14.750      4.679   3.153 0.002253 ** 
## BlockB6       27.750      4.679   5.931 6.67e-08 ***
## BlockB7       13.500      4.679   2.885 0.004980 ** 
## BlockB8       16.000      4.679   3.420 0.000974 ***
## TreatL       -32.750      4.052  -8.083 4.45e-12 ***
## TreatLf      -35.500      4.052  -8.761 1.96e-13 ***
## TreatLfF     -44.250      4.052 -10.921  < 2e-16 ***
## Treatf        -9.250      4.052  -2.283 0.024995 *  
## TreatfF      -18.500      4.052  -4.566 1.71e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.46 on 83 degrees of freedom
## Multiple R-squared:  0.7945, Adjusted R-squared:  0.7648 
## F-statistic: 26.74 on 12 and 83 DF,  p-value: < 2.2e-16