In the code below, we will make use of the Sleuth3
package. To install the Sleuth3 package, use
install.packages("Sleuth3")
The installation step only needs to be done once, but then to use the package in an R session it needs to be loaded with the library()
command as below.
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("Sleuth3")
options(scipen = 100) # eliminate scientific notation
Simple linear regression involves two variables: independent variable and dependent variable. In this example, we will utilize data from the Hubble Telescope regarding the relationship between Velocity and Distance of 24 nebulae outside the Milky Way. Since the Distance depends on the recession Velocity, we will use Distance as the dependent variable and Velocity as the independent variable.
The first step is to take a look at the data
head(case0701)
## Velocity Distance
## 1 170 0.03
## 2 290 0.03
## 3 -130 0.21
## 4 -70 0.26
## 5 -185 0.28
## 6 -220 0.28
summary(case0701)
## Velocity Distance
## Min. :-220.0 Min. :0.0300
## 1st Qu.: 165.0 1st Qu.:0.4075
## Median : 295.0 Median :0.9000
## Mean : 373.1 Mean :0.9113
## 3rd Qu.: 537.5 3rd Qu.:1.1750
## Max. :1090.0 Max. :2.0000
ggplot(case0701, aes(x = Velocity, y = Distance)) +
geom_point()
To add the regression line with uncertainty, we use
ggplot(case0701, aes(x = Velocity, y = Distance)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula 'y ~ x'
From the help menu for these data, we find that the Velocity is measured in kilometers per second (km/s) and the Distance is measured in megaparsecs (Mpc).
Recall that the simple linear regression model is \[ Y_i \stackrel{ind}{\sim} N(\beta_0+\beta_1X_i, \sigma^2). \] In this example, \(X_i\) and \(Y_i\) are the Velocity and Distance, respectively, for nebulae \(i\).
To run the regression, we use
m <- lm(Distance ~ Velocity, data = case0701)
A summary of much of the regression analysis can be obtained using the summary()
function.
summary(m)
##
## Call:
## lm(formula = Distance ~ Velocity, data = case0701)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76717 -0.23517 -0.01083 0.21081 0.91463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3991704 0.1186662 3.364 0.0028 **
## Velocity 0.0013724 0.0002278 6.024 0.00000461 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4056 on 22 degrees of freedom
## Multiple R-squared: 0.6226, Adjusted R-squared: 0.6054
## F-statistic: 36.29 on 1 and 22 DF, p-value: 0.000004608
To simply look at the estimated value of the coefficients.
coef(m)
## (Intercept) Velocity
## 0.399170440 0.001372408
Here we have \(\hat\beta_0 = 0.4\) and \(\hat\beta_1 = 0.0014\). We can obtain 95% credible intervals for these two quantities using
confint(m)
## 2.5 % 97.5 %
## (Intercept) 0.1530719058 0.64526897
## Velocity 0.0008999349 0.00184488
The interpretation for these parameters are - For a nebulae At a velocity of 0, the mean distance (Mpc) from Earth is 0.4 (0.15,0.65). - For each km/s increase in velocity, the mean distance (Mpc) from Earth increases by 0.0014 (0.0009,0.0018).
The summary output also provides an estimate of the error standard deviation \(\sigma\). In this exampe, it is estimated to be
summary(m)$sigma
## [1] 0.4056302
The coefficient of determination, \(R^2\), is interpreted as the proportion of variability in the dependent variable explained by this model. For these data, the coefficient of determination is
summary(m)$r.squared
## [1] 0.6225715
Often this value is multiplied by 100 to represent the results on a percentage basis.
Related to the coefficient of determination and regression is the correlation. The correlation is the square root of \(R^2\) but you need to make sure the sign is consistent with the coefficient for the independent variable \(\beta_1\). We can find the correlation directly using
cor(case0701$Velocity, case0701$Distance)
## [1] 0.789032