library("tidyverse"); theme_set(theme_bw())
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("Sleuth3")
set.seed(20230319)
Normal analyses are (typically) relevant for independent, continuous observations. These observations may need to be transformed to satisfy assumptions of normality. Pedagogically, we often begin with a single group with a known variance. A known variance is uncommon in practice and thus, here, we skip directly to the unknown variance situation.
The single group normal analysis begins with the assumptions \[ Y_i \stackrel{ind}{\sim} N(\mu,\sigma^2) \] for \(i=1,\ldots,n\).
An alternative way to write this model is \[ Y_i = \mu + \epsilon_i, \qquad \epsilon_i \stackrel{ind}{\sim} N(0,\sigma^2) \]
which has the following assumptions
and
creativity <- case0101 %>%
filter(Treatment == "Intrinsic")
summary(creativity)
## Score Treatment
## Min. :12.00 Extrinsic: 0
## 1st Qu.:17.43 Intrinsic:24
## Median :20.40
## Mean :19.88
## 3rd Qu.:22.30
## Max. :29.70
creativity %>%
summarize(
n = n(),
mean = mean(Score),
sd = sd(Score)
)
## n mean sd
## 1 24 19.88333 4.439513
ggplot(creativity, aes(x = Score)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(creativity, aes(x = Score)) +
geom_boxplot(outlier.shape = NA)
In a t-test, we are typically interested in the hypothesis \[ H_0: \mu = \mu_0 \] for some value \(\mu_0\).
t.test(creativity$Score)
##
## One Sample t-test
##
## data: creativity$Score
## t = 21.941, df = 23, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 18.00869 21.75798
## sample estimates:
## mean of x
## 19.88333
Alternatively, using a formula
t.test(Score ~ 1, data = creativity)
##
## One Sample t-test
##
## data: Score
## t = 21.941, df = 23, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 18.00869 21.75798
## sample estimates:
## mean of x
## 19.88333
t.test(creativity$Score, mu = 20, alternative = "greater", conf.level = 0.9)
##
## One Sample t-test
##
## data: creativity$Score
## t = -0.12874, df = 23, p-value = 0.5507
## alternative hypothesis: true mean is greater than 20
## 90 percent confidence interval:
## 18.68762 Inf
## sample estimates:
## mean of x
## 19.88333
When you have two groups, the assumption is
\[ Y_{ig} \stackrel{ind}{\sim} N(\mu_g, \sigma_g^2) \] for \(i = 1, \ldots, n_g\) and \(g=1,2\). Alternatively
\[ Y_{ig} = \mu_g + \epsilon_{ig}, \qquad \epsilon_{ig} \stackrel{ind}{\sim} N(0,\sigma_g^2) \] which assumes
and
case0101 %>%
group_by(Treatment) %>%
summarize(
n = n(),
mean = mean(Score),
sd = sd(Score),
.groups = "drop"
)
## # A tibble: 2 × 4
## Treatment n mean sd
## <fct> <int> <dbl> <dbl>
## 1 Extrinsic 23 15.7 5.25
## 2 Intrinsic 24 19.9 4.44
Typically we are interested in the hypothesis \[ H_0: \mu_1 = \mu_2 \qquad \mbox{or, equivalently,} \qquad H_0: \mu_1-\mu_2 = 0 \] and a confidence interval for the difference \(\mu_1-\mu_2\).
To run the test, we first need to get the appropriate data
intrinsic <- case0101$Score[case0101$Treatment == "Intrinsic"]
extrinsic <- case0101$Score[case0101$Treatment == "Extrinsic"]
then we can run the test
t.test(extrinsic, intrinsic)
##
## Welch Two Sample t-test
##
## data: extrinsic and intrinsic
## t = -2.9153, df = 43.108, p-value = 0.005618
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.010803 -1.277603
## sample estimates:
## mean of x mean of y
## 15.73913 19.88333
Alternatively, we can use the formula version and simplify our code
t.test(Score ~ Treatment, data = case0101)
##
## Welch Two Sample t-test
##
## data: Score by Treatment
## t = -2.9153, df = 43.108, p-value = 0.005618
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
## -7.010803 -1.277603
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic
## 15.73913 19.88333
Often, we will make the assumption that the two groups have equal variances.
t.test(Score ~ Treatment, data = case0101, var.equal = TRUE)
##
## Two Sample t-test
##
## data: Score by Treatment
## t = -2.9259, df = 45, p-value = 0.005366
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
## -6.996973 -1.291432
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic
## 15.73913 19.88333
t.test(Score ~ Treatment, data = case0101, var.equal = TRUE,
mu = 5, alternative = "greater", conf.level = 0.9)
##
## Two Sample t-test
##
## data: Score by Treatment
## t = -6.456, df = 45, p-value = 1
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is greater than 5
## 90 percent confidence interval:
## -5.986439 Inf
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic
## 15.73913 19.88333
A paired analysis can be used when the data are have a natural grouping structure into a pair of observations. Within each pair of observations, one observation has one level of the explanatory variable while the other observation has the other level of the explanatory variable.
In the following example, the data are naturally paired because each set of observations is on a set of identical twins. One twin has schizophrenia while the other does not. These data compares the volume of the left hippocampus to understand its relationship to schizophrenia.
summary(case0202)
## Unaffected Affected
## Min. :1.250 Min. :1.02
## 1st Qu.:1.600 1st Qu.:1.31
## Median :1.770 Median :1.59
## Mean :1.759 Mean :1.56
## 3rd Qu.:1.935 3rd Qu.:1.78
## Max. :2.080 Max. :2.02
schizophrenia <- case0202 %>%
mutate(pair = LETTERS[1:n()]) %>%
pivot_longer(-pair, names_to = "diagnosis", values_to = "volume")
ggplot(schizophrenia, aes(x = diagnosis, y = volume, group = pair)) +
geom_line()
It is important to connect these points between each twin because we are looking for a pattern amongst the comparisons.
When the data are paired, we can perform a paired t-test.
t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)
##
## Paired t-test
##
## data: case0202$Unaffected and case0202$Affected
## t = 3.2289, df = 14, p-value = 0.006062
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.0667041 0.3306292
## sample estimates:
## mean difference
## 0.1986667
This analysis is equivalent to taking the difference in each pair and performing a one-sample t-test.
t.test(case0202$Unaffected - case0202$Affected)
##
## One Sample t-test
##
## data: case0202$Unaffected - case0202$Affected
## t = 3.2289, df = 14, p-value = 0.006062
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.0667041 0.3306292
## sample estimates:
## mean of x
## 0.1986667
If we have a single categorical variable with more than two groups, then \(G > 2\).
case0501 %>%
group_by(Diet) %>%
summarize(
n = n(),
mean = mean(Lifetime),
sd = sd(Lifetime),
.groups = "drop"
)
## # A tibble: 6 × 4
## Diet n mean sd
## <fct> <int> <dbl> <dbl>
## 1 N/N85 57 32.7 5.13
## 2 N/R40 60 45.1 6.70
## 3 N/R50 71 42.3 7.77
## 4 NP 49 27.4 6.13
## 5 R/R50 56 42.9 6.68
## 6 lopro 56 39.7 6.99
ggplot(case0501, aes(x = Diet, y = Lifetime)) +
geom_jitter(width = 0.1)
An ANOVA F-test considers the following null hypothesis \[ H_0: \mu_g = \mu \quad \forall\, g \]
m <- lm(Lifetime ~ Diet, data = case0501)
anova(m)
## Analysis of Variance Table
##
## Response: Lifetime
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 5 12734 2546.8 57.104 < 2.2e-16 ***
## Residuals 343 15297 44.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As a general approach, regression allows the response variable to depend on categorical and continuous variables in complex patterns. For the current discussion, we will limit our discussion to one categorical variable.
The general structure of a regression model is \[ Y_i \stackrel{ind}{\sim} N(\beta_0 + \beta_1X_{i,1} + \cdots + \beta_p X_{i,p}, \sigma^2) \] for \(i=1,\ldots,n\) or, equivalently, \[ Y_i = \beta_0 + \beta_1X_{i,1} + \cdots + \beta_p X_{i,p}, \qquad \epsilon_i \stackrel{ind}{\sim} N(0, \sigma^2). \] The assumptions a regression model are
and
We can repeat the equal variance analysis using regression by encoding a dummy or (indicator) variable to represent the group. So let \[ X_{i,1} = \left\{ \begin{array}{ll} 1 & \mbox{if Treatment}_i \mbox{ is Intrinsic} \\ 0 & \mbox{otherwise} \end{array} \right. \]
m <- lm(Score ~ Treatment, data = case0101)
We can summarize the analysis in a few different ways.
summary(m)
##
## Call:
## lm(formula = Score ~ Treatment, data = case0101)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.739 -2.983 1.061 2.961 9.817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.739 1.012 15.550 < 2e-16 ***
## TreatmentIntrinsic 4.144 1.416 2.926 0.00537 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.854 on 45 degrees of freedom
## Multiple R-squared: 0.1598, Adjusted R-squared: 0.1412
## F-statistic: 8.561 on 1 and 45 DF, p-value: 0.005366
confint(m)
## 2.5 % 97.5 %
## (Intercept) 13.700570 17.777691
## TreatmentIntrinsic 1.291432 6.996973
anova(m)
## Analysis of Variance Table
##
## Response: Score
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 1 201.71 201.708 8.5608 0.005366 **
## Residuals 45 1060.29 23.562
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we have a single categorical variable with more than two groups, we can still utilize a regression model. To fit this regression model, we
Generally, we let R do this for us where the reference level is the first level of the factor variable or first alphabetically for a character variable.
m <- lm(Lifetime ~ Diet, data = case0501)
summary(m)
##
## Call:
## lm(formula = Lifetime ~ Diet, data = case0501)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5167 -3.3857 0.8143 5.1833 10.0143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.6912 0.8846 36.958 < 2e-16 ***
## DietN/R40 12.4254 1.2352 10.059 < 2e-16 ***
## DietN/R50 9.6060 1.1877 8.088 1.06e-14 ***
## DietNP -5.2892 1.3010 -4.065 5.95e-05 ***
## DietR/R50 10.1945 1.2565 8.113 8.88e-15 ***
## Dietlopro 6.9945 1.2565 5.567 5.25e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared: 0.4543, Adjusted R-squared: 0.4463
## F-statistic: 57.1 on 5 and 343 DF, p-value: < 2.2e-16
When we have two categorical variables we can utilize regression by
For simplicity, let’s start with the paired analysis.
m <- lm(volume ~ pair + diagnosis, schizophrenia)
summary(m)
##
## Call:
## lm(formula = volume ~ pair + diagnosis, data = schizophrenia)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23567 -0.07558 0.00000 0.07558 0.23567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.506e+00 1.231e-01 12.236 7.28e-09 ***
## pairB -7.000e-02 1.685e-01 -0.415 0.68412
## pairC -9.000e-02 1.685e-01 -0.534 0.60163
## pairD -1.200e-01 1.685e-01 -0.712 0.48806
## pairE 3.900e-01 1.685e-01 2.315 0.03633 *
## pairF -1.450e-01 1.685e-01 -0.861 0.40399
## pairG 1.250e-01 1.685e-01 0.742 0.47044
## pairH 1.150e-01 1.685e-01 0.682 0.50606
## pairI -7.500e-02 1.685e-01 -0.445 0.66305
## pairJ 2.800e-01 1.685e-01 1.662 0.11879
## pairK -4.700e-01 1.685e-01 -2.789 0.01448 *
## pairL 3.000e-02 1.685e-01 0.178 0.86124
## pairM 4.250e-01 1.685e-01 2.522 0.02439 *
## pairN 1.310e-15 1.685e-01 0.000 1.00000
## pairO 4.200e-01 1.685e-01 2.493 0.02583 *
## diagnosisUnaffected 1.987e-01 6.153e-02 3.229 0.00606 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1685 on 14 degrees of freedom
## Multiple R-squared: 0.8336, Adjusted R-squared: 0.6554
## F-statistic: 4.677 on 15 and 14 DF, p-value: 0.003132
The diagnosis line matches the paired analysis we did earlier:
t.test(case0202$Unaffected, case0202$Affected, paired = TRUE)$conf.int
## [1] 0.0667041 0.3306292
## attr(,"conf.level")
## [1] 0.95
confint(m)[16,]
## 2.5 % 97.5 %
## 0.0667041 0.3306292
case1301 %>%
group_by(Block, Treat) %>%
summarize(
n = n(),
mean = mean(Cover),
sd = sd(Cover),
.groups = "drop"
)
## # A tibble: 48 × 5
## Block Treat n mean sd
## <fct> <fct> <int> <dbl> <dbl>
## 1 B1 C 2 18.5 6.36
## 2 B1 L 2 4 0
## 3 B1 Lf 2 4 1.41
## 4 B1 LfF 2 1.5 0.707
## 5 B1 f 2 17.5 9.19
## 6 B1 fF 2 11.5 2.12
## 7 B2 C 2 28.5 9.19
## 8 B2 L 2 7.5 0.707
## 9 B2 Lf 2 4.5 2.12
## 10 B2 LfF 2 4 1.41
## # … with 38 more rows
While graphics have always been helpful, they now become even more helpful. We may need to play around with the plots to get one we like.
ggplot(case1301, aes(x = Block, y = Cover, shape = Treat, color = Treat)) +
geom_point()
ggplot(case1301, aes(x = Treat, y = Cover, shape = Treat, color = Treat)) +
geom_point() +
facet_wrap(~Block, ncol = 4)
m <- lm(Cover ~ Block + Treat, data = case1301)
anova(m) # Two-way ANOVA
## Analysis of Variance Table
##
## Response: Cover
## Df Sum Sq Mean Sq F value Pr(>F)
## Block 7 19106 2729.4 20.780 6.977e-16 ***
## Treat 5 23046 4609.1 35.092 < 2.2e-16 ***
## Residuals 83 10902 131.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m)
##
## Call:
## lm(formula = Cover ~ Block + Treat, data = case1301)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.375 -5.812 0.625 5.438 26.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.875 4.217 7.795 1.66e-11 ***
## BlockB2 3.750 4.679 0.801 0.425132
## BlockB3 31.750 4.679 6.786 1.59e-09 ***
## BlockB4 45.500 4.679 9.725 2.32e-15 ***
## BlockB5 14.750 4.679 3.153 0.002253 **
## BlockB6 27.750 4.679 5.931 6.67e-08 ***
## BlockB7 13.500 4.679 2.885 0.004980 **
## BlockB8 16.000 4.679 3.420 0.000974 ***
## TreatL -32.750 4.052 -8.083 4.45e-12 ***
## TreatLf -35.500 4.052 -8.761 1.96e-13 ***
## TreatLfF -44.250 4.052 -10.921 < 2e-16 ***
## Treatf -9.250 4.052 -2.283 0.024995 *
## TreatfF -18.500 4.052 -4.566 1.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.46 on 83 degrees of freedom
## Multiple R-squared: 0.7945, Adjusted R-squared: 0.7648
## F-statistic: 26.74 on 12 and 83 DF, p-value: < 2.2e-16