To follow along, use the lab09 code and make sure the following packages are installed:

You can use the following code to perform the installation:

install.packages("GGally")

Now load packages

library("tidyverse")
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5          v purrr   0.3.4     
## v tibble  3.1.2          v dplyr   1.0.7     
## v tidyr   1.1.3          v stringr 1.4.0.9000
## v readr   1.4.0          v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("GGally")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library("Sleuth3")

Higher order terms

Consider the case1001 data set.

ggplot(case1001, aes(Height, Distance)) +
  geom_point() + theme_bw()

There are a number of ways to run the cubic model.

# First create the variables yourself
case1001_tmp <- case1001 %>%
  mutate(Height2 = Height * Height,
         Height3 = Height * Height2)

# Then run the regression
m1 <- lm(Distance ~ Height + Height2 + Height3, data = case1001_tmp)

I don’t prefer this method because

new <- data.frame(Height = 500) %>%
  mutate(Height2 = Height * Height,  # Need to manually create the
         Height3 = Height * Height2) # quadratic and cubic terms
predict(m1, new)
##        1 
## 470.6527

An alternative approach is to use a formula to construct the higher order terms. e.g.

m2 <- lm(Distance ~ Height + I(Height^2) + I(Height^3), data = case1001) # or
m3 <- lm(Distance ~ poly(Height, 3, raw = TRUE),        data = case1001)

Now, your original data.frame is unchanged and your new data.frame is simpler.

new <- data.frame(Height = 500)
predict(m2, new) # or
##        1 
## 470.6527
predict(m3, new)
##        1 
## 470.6527

Additional explanatory variables

Run the following code to construct the longnosedace data

# From http://www.biostathandbook.com/multipleregression.html
# Modified from http://rcompanion.org/rcompanion/e_05.html
Input = ("
stream                   count acreage  dO2   maxdepth  no3   so4     temp
BASIN_RUN                  13         2528    9.6  80        2.28  16.75   15.3
BEAR_BR                    12         3333    8.5  83        5.34   7.74   19.4
BEAR_CR                    54        19611    8.3  96        0.99  10.92   19.5
BEAVER_DAM_CR              19         3570    9.2  56        5.44  16.53   17
BEAVER_RUN                 37         1722    8.1  43        5.66   5.91   19.3
BENNETT_CR                  2          583    9.2  51        2.26   8.81   12.9
BIG_BR                     72         4790    9.4  91        4.1    5.65   16.7
BIG_ELK_CR                164        35971   10.2  81        3.2   17.53   13.8
BIG_PIPE_CR                18        25440    7.5  120       3.53   8.2    13.7
BLUE_LICK_RUN               1         2217    8.5  46        1.2   10.85   14.3
BROAD_RUN                  53         1971   11.9  56        3.25  11.12   22.2
BUFFALO_RUN                16        12620    8.3  37        0.61  18.87   16.8
BUSH_CR                    32        19046    8.3  120       2.93  11.31   18
CABIN_JOHN_CR              21         8612    8.2  103       1.57  16.09   15
CARROLL_BR                 23         3896   10.4  105       2.77  12.79   18.4
COLLIER_RUN                18         6298    8.6  42        0.26  17.63   18.2
CONOWINGO_CR              112        27350    8.5  65        6.95  14.94   24.1
DEAD_RUN                   25         4145    8.7  51        0.34  44.93   23
DEEP_RUN                    5         1175    7.7  57        1.3   21.68   21.8
DEER_CR                    26         8297    9.9  60        5.26  6.36    19.1
DORSEY_RUN                  8         7814    6.8  160       0.44  20.24   22.6
FALLS_RUN                  15         1745    9.4  48        2.19  10.27   14.3
FISHING_CR                 11         5046    7.6  109       0.73   7.1    19
FLINTSTONE_CR              11        18943    9.2  50        0.25  14.21   18.5
GREAT_SENECA_CR            87         8624    8.6  78        3.37   7.51   21.3
GREENE_BR                  33         2225    9.1  41        2.3    9.72   20.5
GUNPOWDER_FALLS            22        12659    9.7  65        3.3    5.98   18
HAINES_BR                  98         1967    8.6  50        7.71  26.44   16.8
HAWLINGS_R                  1         1172    8.3  73        2.62   4.64   20.5
HAY_MEADOW_BR               5          639    9.5  26        3.53   4.46   20.1
HERRINGTON_RUN              1         7056    6.4  60        0.25   9.82   24.5
HOLLANDS_BR                38         1934   10.5  85        2.34  11.44   12
ISRAEL_CR                  30         6260    9.5  133       2.41  13.77   21
LIBERTY_RES                12          424    8.3  62        3.49   5.82   20.2
LITTLE_ANTIETAM_CR         24         3488    9.3  44        2.11  13.37   24
LITTLE_BEAR_CR              6         3330    9.1  67        0.81   8.16   14.9
LITTLE_CONOCOCHEAGUE_CR    15         2227    6.8  54        0.33   7.6    24
LITTLE_DEER_CR             38         8115    9.6  110       3.4    9.22   20.5
LITTLE_FALLS               84         1600   10.2  56        3.54   5.69   19.5
LITTLE_GUNPOWDER_R          3        15305    9.7  85        2.6    6.96   17.5
LITTLE_HUNTING_CR          18         7121    9.5  58        0.51   7.41   16
LITTLE_PAINT_BR            63         5794    9.4  34        1.19  12.27   17.5
MAINSTEM_PATUXENT_R       239         8636    8.4  150       3.31   5.95   18.1
MEADOW_BR                 234         4803    8.5  93        5.01  10.98   24.3
MILL_CR                     6         1097    8.3  53        1.71  15.77   13.1
MORGAN_RUN                 76         9765    9.3  130       4.38   5.74   16.9
MUDDY_BR                   25         4266    8.9  68        2.05  12.77   17
MUDLICK_RUN                 8         1507    7.4  51        0.84  16.3    21
NORTH_BR                   23         3836    8.3  121       1.32   7.36   18.5
NORTH_BR_CASSELMAN_R       16        17419    7.4  48        0.29   2.5    18
NORTHWEST_BR                6         8735    8.2  63        1.56  13.22   20.8
NORTHWEST_BR_ANACOSTIA_R  100        22550    8.4  107       1.41  14.45   23
OWENS_CR                   80         9961    8.6  79        1.02   9.07   21.8
PATAPSCO_R                 28         4706    8.9  61        4.06   9.9    19.7
PINEY_BR                   48         4011    8.3  52        4.7    5.38   18.9
PINEY_CR                   18         6949    9.3  100       4.57  17.84   18.6
PINEY_RUN                  36        11405    9.2  70        2.17  10.17   23.6
PRETTYBOY_BR               19          904    9.8  39        6.81   9.2    19.2
RED_RUN                    32         3332    8.4  73        2.09   5.5    17.7
ROCK_CR                     3          575    6.8  33        2.47   7.61   18
SAVAGE_R                  106        29708    7.7  73        0.63  12.28   21.4
SECOND_MINE_BR             62         2511   10.2  60        4.17  10.75   17.7
SENECA_CR                  23        18422    9.9  45        1.58   8.37   20.1
SOUTH_BR_CASSELMAN_R        2         6311    7.6  46        0.64  21.16   18.5
SOUTH_BR_PATAPSCO          26         1450    7.9  60        2.96   8.84   18.6
SOUTH_FORK_LINGANORE_CR    20         4106   10.0  96        2.62   5.45   15.4
TUSCARORA_CR               38        10274    9.3  90        5.45  24.76   15
WATTS_BR                   19          510    6.7  82        5.25  14.19   26.5
")

longnosedace = read.table(textConnection(Input),header=TRUE)

Take a quick look at the data

head(longnosedace)
##          stream count acreage dO2 maxdepth  no3   so4 temp
## 1     BASIN_RUN    13    2528 9.6       80 2.28 16.75 15.3
## 2       BEAR_BR    12    3333 8.5       83 5.34  7.74 19.4
## 3       BEAR_CR    54   19611 8.3       96 0.99 10.92 19.5
## 4 BEAVER_DAM_CR    19    3570 9.2       56 5.44 16.53 17.0
## 5    BEAVER_RUN    37    1722 8.1       43 5.66  5.91 19.3
## 6    BENNETT_CR     2     583 9.2       51 2.26  8.81 12.9
summary(longnosedace)
##     stream              count           acreage           dO2        
##  Length:68          Min.   :  1.00   Min.   :  424   Min.   : 6.400  
##  Class :character   1st Qu.: 12.00   1st Qu.: 2156   1st Qu.: 8.300  
##  Mode  :character   Median : 23.00   Median : 4748   Median : 8.600  
##                     Mean   : 38.81   Mean   : 7565   Mean   : 8.762  
##                     3rd Qu.: 40.50   3rd Qu.: 8992   3rd Qu.: 9.425  
##                     Max.   :239.00   Max.   :35971   Max.   :11.900  
##     maxdepth           no3             so4              temp      
##  Min.   : 26.00   Min.   :0.250   Min.   : 2.500   Min.   :12.00  
##  1st Qu.: 51.00   1st Qu.:1.198   1st Qu.: 7.397   1st Qu.:16.98  
##  Median : 64.00   Median :2.375   Median :10.220   Median :18.60  
##  Mean   : 72.56   Mean   :2.672   Mean   :11.650   Mean   :18.87  
##  3rd Qu.: 90.25   3rd Qu.:3.533   3rd Qu.:14.270   3rd Qu.:20.85  
##  Max.   :160.00   Max.   :7.710   Max.   :44.930   Max.   :26.50
ggpairs(longnosedace %>% select(-stream),
        progress = interactive()) +
  theme_bw()

Adding additional explanatory variables is accomplished by separating the explanatory variables by +.

m <- lm(sqrt(count) ~ no3 + maxdepth, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + maxdepth, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3896 -1.9651 -0.5409  1.3018  7.8564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1.5683     1.0285   1.525  0.13217   
## no3           0.6005     0.1882   3.191  0.00218 **
## maxdepth      0.0308     0.0117   2.633  0.01057 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.804 on 65 degrees of freedom
## Multiple R-squared:  0.2145, Adjusted R-squared:  0.1903 
## F-statistic: 8.874 on 2 and 65 DF,  p-value: 0.0003914

You can run a regression with all variables using .

m <- lm(sqrt(count) ~ ., data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ ., data = longnosedace)
## 
## Residuals:
## ALL 68 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (6 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                      3.6056         NA      NA       NA
## streamBEAR_BR                   -0.1414         NA      NA       NA
## streamBEAR_CR                    3.7429         NA      NA       NA
## streamBEAVER_DAM_CR              0.7533         NA      NA       NA
## streamBEAVER_RUN                 2.4772         NA      NA       NA
## streamBENNETT_CR                -2.1913         NA      NA       NA
## streamBIG_BR                     4.8797         NA      NA       NA
## streamBIG_ELK_CR                 9.2007         NA      NA       NA
## streamBIG_PIPE_CR                0.6371         NA      NA       NA
## streamBLUE_LICK_RUN             -2.6056         NA      NA       NA
## streamBROAD_RUN                  3.6746         NA      NA       NA
## streamBUFFALO_RUN                0.3944         NA      NA       NA
## streamBUSH_CR                    2.0513         NA      NA       NA
## streamCABIN_JOHN_CR              0.9770         NA      NA       NA
## streamCARROLL_BR                 1.1903         NA      NA       NA
## streamCOLLIER_RUN                0.6371         NA      NA       NA
## streamCONOWINGO_CR               6.9775         NA      NA       NA
## streamDEAD_RUN                   1.3944         NA      NA       NA
## streamDEEP_RUN                  -1.3695         NA      NA       NA
## streamDEER_CR                    1.4935         NA      NA       NA
## streamDORSEY_RUN                -0.7771         NA      NA       NA
## streamFALLS_RUN                  0.2674         NA      NA       NA
## streamFISHING_CR                -0.2889         NA      NA       NA
## streamFLINTSTONE_CR             -0.2889         NA      NA       NA
## streamGREAT_SENECA_CR            5.7218         NA      NA       NA
## streamGREENE_BR                  2.1390         NA      NA       NA
## streamGUNPOWDER_FALLS            1.0849         NA      NA       NA
## streamHAINES_BR                  6.2939         NA      NA       NA
## streamHAWLINGS_R                -2.6056         NA      NA       NA
## streamHAY_MEADOW_BR             -1.3695         NA      NA       NA
## streamHERRINGTON_RUN            -2.6056         NA      NA       NA
## streamHOLLANDS_BR                2.5589         NA      NA       NA
## streamISRAEL_CR                  1.8717         NA      NA       NA
## streamLIBERTY_RES               -0.1414         NA      NA       NA
## streamLITTLE_ANTIETAM_CR         1.2934         NA      NA       NA
## streamLITTLE_BEAR_CR            -1.1561         NA      NA       NA
## streamLITTLE_CONOCOCHEAGUE_CR    0.2674         NA      NA       NA
## streamLITTLE_DEER_CR             2.5589         NA      NA       NA
## streamLITTLE_FALLS               5.5596         NA      NA       NA
## streamLITTLE_GUNPOWDER_R        -1.8735         NA      NA       NA
## streamLITTLE_HUNTING_CR          0.6371         NA      NA       NA
## streamLITTLE_PAINT_BR            4.3317         NA      NA       NA
## streamMAINSTEM_PATUXENT_R       11.8541         NA      NA       NA
## streamMEADOW_BR                 11.6915         NA      NA       NA
## streamMILL_CR                   -1.1561         NA      NA       NA
## streamMORGAN_RUN                 5.1122         NA      NA       NA
## streamMUDDY_BR                   1.3944         NA      NA       NA
## streamMUDLICK_RUN               -0.7771         NA      NA       NA
## streamNORTH_BR                   1.1903         NA      NA       NA
## streamNORTH_BR_CASSELMAN_R       0.3944         NA      NA       NA
## streamNORTHWEST_BR              -1.1561         NA      NA       NA
## streamNORTHWEST_BR_ANACOSTIA_R   6.3944         NA      NA       NA
## streamOWENS_CR                   5.3387         NA      NA       NA
## streamPATAPSCO_R                 1.6860         NA      NA       NA
## streamPINEY_BR                   3.3227         NA      NA       NA
## streamPINEY_CR                   0.6371         NA      NA       NA
## streamPINEY_RUN                  2.3944         NA      NA       NA
## streamPRETTYBOY_BR               0.7533         NA      NA       NA
## streamRED_RUN                    2.0513         NA      NA       NA
## streamROCK_CR                   -1.8735         NA      NA       NA
## streamSAVAGE_R                   6.6901         NA      NA       NA
## streamSECOND_MINE_BR             4.2685         NA      NA       NA
## streamSENECA_CR                  1.1903         NA      NA       NA
## streamSOUTH_BR_CASSELMAN_R      -2.1913         NA      NA       NA
## streamSOUTH_BR_PATAPSCO          1.4935         NA      NA       NA
## streamSOUTH_FORK_LINGANORE_CR    0.8666         NA      NA       NA
## streamTUSCARORA_CR               2.5589         NA      NA       NA
## streamWATTS_BR                   0.7533         NA      NA       NA
## acreage                              NA         NA      NA       NA
## dO2                                  NA         NA      NA       NA
## maxdepth                             NA         NA      NA       NA
## no3                                  NA         NA      NA       NA
## so4                                  NA         NA      NA       NA
## temp                                 NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 67 and 0 DF,  p-value: NA

But you need to make sure you want to include all the variables. In this case, we probably wanted to exclude the stream variable

m <- lm(sqrt(count) ~ ., data = longnosedace %>% select(-stream))
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ ., data = longnosedace %>% select(-stream))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3635 -1.9071 -0.3538  1.3078  8.2489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.259e+00  4.084e+00  -2.267  0.02694 * 
## acreage      1.428e-04  4.153e-05   3.440  0.00106 **
## dO2          7.157e-01  3.311e-01   2.162  0.03456 * 
## maxdepth     2.277e-02  1.097e-02   2.075  0.04223 * 
## no3          5.580e-01  1.787e-01   3.123  0.00274 **
## so4          7.571e-03  4.756e-02   0.159  0.87406   
## temp         2.165e-01  1.042e-01   2.077  0.04198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.524 on 61 degrees of freedom
## Multiple R-squared:  0.4026, Adjusted R-squared:  0.3438 
## F-statistic: 6.852 on 6 and 61 DF,  p-value: 1.367e-05

or

m <- lm(sqrt(count) ~ .-stream, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ . - stream, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3635 -1.9071 -0.3538  1.3078  8.2489 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.259e+00  4.084e+00  -2.267  0.02694 * 
## acreage      1.428e-04  4.153e-05   3.440  0.00106 **
## dO2          7.157e-01  3.311e-01   2.162  0.03456 * 
## maxdepth     2.277e-02  1.097e-02   2.075  0.04223 * 
## no3          5.580e-01  1.787e-01   3.123  0.00274 **
## so4          7.571e-03  4.756e-02   0.159  0.87406   
## temp         2.165e-01  1.042e-01   2.077  0.04198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.524 on 61 degrees of freedom
## Multiple R-squared:  0.4026, Adjusted R-squared:  0.3438 
## F-statistic: 6.852 on 6 and 61 DF,  p-value: 1.367e-05

You can also use this technique to eliminate the intercept and perform “regression through the origin”.

m <- lm(sqrt(count) ~ no3 + acreage - 1, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + acreage - 1, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9745 -1.3463  0.5319  2.3545  9.5786 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## no3     1.143e+00  1.273e-01   8.976 4.80e-13 ***
## acreage 2.431e-04  3.806e-05   6.386 1.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.849 on 66 degrees of freedom
## Multiple R-squared:  0.797,  Adjusted R-squared:  0.7909 
## F-statistic: 129.6 on 2 and 66 DF,  p-value: < 2.2e-16

This is not a common model but there may be a time when you want to force the intercept to be 0.

Interactions

Interactions can be added to a model by using a colon (:) between two explanatory variables, e.g.

m1 <- lm(sqrt(count) ~ no3 + maxdepth + no3:maxdepth, data = longnosedace)
summary(m1)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + maxdepth + no3:maxdepth, data = longnosedace)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.369 -1.848 -0.567  1.251  7.213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.017978   1.532259   1.970   0.0532 .
## no3          -0.006920   0.513095  -0.013   0.9893  
## maxdepth      0.007891   0.021448   0.368   0.7142  
## no3:maxdepth  0.009373   0.007372   1.272   0.2081  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 64 degrees of freedom
## Multiple R-squared:  0.2338, Adjusted R-squared:  0.1979 
## F-statistic: 6.511 on 3 and 64 DF,  p-value: 0.0006502

But because we have a convention of including main effects (lower order terms) whenever an interaction (higher order term) is included in the model, R provides a convenient way of including both the main effects as well as the interaction by using an asterisk (*).

m2 <- lm(sqrt(count) ~ no3*maxdepth, data = longnosedace)
summary(m2)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 * maxdepth, data = longnosedace)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.369 -1.848 -0.567  1.251  7.213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.017978   1.532259   1.970   0.0532 .
## no3          -0.006920   0.513095  -0.013   0.9893  
## maxdepth      0.007891   0.021448   0.368   0.7142  
## no3:maxdepth  0.009373   0.007372   1.272   0.2081  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 64 degrees of freedom
## Multiple R-squared:  0.2338, Adjusted R-squared:  0.1979 
## F-statistic: 6.511 on 3 and 64 DF,  p-value: 0.0006502

Alternatively, you can use the caret symbol (^) and specifying what order interactions you want. For example ^2 will include all two-way interactions.

m3 <- lm(sqrt(count) ~ (no3 + maxdepth)^2, data = longnosedace)
summary(m3)
## 
## Call:
## lm(formula = sqrt(count) ~ (no3 + maxdepth)^2, data = longnosedace)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.369 -1.848 -0.567  1.251  7.213 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.017978   1.532259   1.970   0.0532 .
## no3          -0.006920   0.513095  -0.013   0.9893  
## maxdepth      0.007891   0.021448   0.368   0.7142  
## no3:maxdepth  0.009373   0.007372   1.272   0.2081  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 64 degrees of freedom
## Multiple R-squared:  0.2338, Adjusted R-squared:  0.1979 
## F-statistic: 6.511 on 3 and 64 DF,  p-value: 0.0006502

If you want to include more terms, you have a couple of options, e.g. 

m4 <- lm(sqrt(count) ~ (no3 + maxdepth + acreage)^2, data = longnosedace) # main effects and two-way interactions
summary(m4)
## 
## Call:
## lm(formula = sqrt(count) ~ (no3 + maxdepth + acreage)^2, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9715 -1.7287 -0.6365  1.3585  7.4738 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       9.771e-01  1.891e+00   0.517   0.6072  
## no3               3.596e-01  4.977e-01   0.723   0.4727  
## maxdepth          1.277e-02  2.575e-02   0.496   0.6218  
## acreage           2.893e-04  1.585e-04   1.825   0.0729 .
## no3:maxdepth      7.511e-03  7.100e-03   1.058   0.2943  
## no3:acreage      -2.060e-05  2.156e-05  -0.956   0.3431  
## maxdepth:acreage -1.200e-06  1.859e-06  -0.646   0.5210  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.613 on 61 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.2968 
## F-statistic: 5.714 on 6 and 61 DF,  p-value: 9.178e-05
m5 <- lm(sqrt(count) ~ (no3 + maxdepth + acreage)^3, data = longnosedace) # main effects, two-way interactions, and three-way interaction
summary(m5)
## 
## Call:
## lm(formula = sqrt(count) ~ (no3 + maxdepth + acreage)^3, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1123 -1.5007 -0.3734  1.1348  7.5601 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           4.540e+00  2.492e+00   1.822   0.0734 .
## no3                  -9.005e-01  7.668e-01  -1.174   0.2449  
## maxdepth             -4.593e-02  3.735e-02  -1.230   0.2236  
## acreage              -2.026e-04  2.787e-04  -0.727   0.4700  
## no3:maxdepth          2.856e-02  1.210e-02   2.361   0.0215 *
## no3:acreage           1.679e-04  9.141e-05   1.837   0.0712 .
## maxdepth:acreage      6.495e-06  4.057e-06   1.601   0.1147  
## no3:maxdepth:acreage -2.895e-06  1.366e-06  -2.119   0.0383 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 60 degrees of freedom
## Multiple R-squared:  0.4044, Adjusted R-squared:  0.3349 
## F-statistic: 5.819 on 7 and 60 DF,  p-value: 3.555e-05
m5 <- lm(sqrt(count) ~ no3 * maxdepth * acreage, data = longnosedace) # main effects, two-way interactions, and three-way interaction
summary(m5)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 * maxdepth * acreage, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1123 -1.5007 -0.3734  1.1348  7.5601 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           4.540e+00  2.492e+00   1.822   0.0734 .
## no3                  -9.005e-01  7.668e-01  -1.174   0.2449  
## maxdepth             -4.593e-02  3.735e-02  -1.230   0.2236  
## acreage              -2.026e-04  2.787e-04  -0.727   0.4700  
## no3:maxdepth          2.856e-02  1.210e-02   2.361   0.0215 *
## no3:acreage           1.679e-04  9.141e-05   1.837   0.0712 .
## maxdepth:acreage      6.495e-06  4.057e-06   1.601   0.1147  
## no3:maxdepth:acreage -2.895e-06  1.366e-06  -2.119   0.0383 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 60 degrees of freedom
## Multiple R-squared:  0.4044, Adjusted R-squared:  0.3349 
## F-statistic: 5.819 on 7 and 60 DF,  p-value: 3.555e-05

We can combine the . with this notation for higher order terms.

m <- lm(sqrt(count) ~ .^2, data = longnosedace %>% select(-stream))
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ .^2, data = longnosedace %>% select(-stream))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6901 -1.5280 -0.2438  1.4632  6.0108 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -1.920e+01  2.715e+01  -0.707  0.48300   
## acreage          -5.081e-05  8.671e-04  -0.059  0.95352   
## dO2               2.193e+00  2.572e+00   0.853  0.39825   
## maxdepth          7.802e-02  1.856e-01   0.420  0.67626   
## no3               7.466e-01  4.068e+00   0.184  0.85520   
## so4              -2.789e-01  9.418e-01  -0.296  0.76846   
## temp              4.744e-01  1.062e+00   0.447  0.65715   
## acreage:dO2      -2.020e-05  6.624e-05  -0.305  0.76172   
## acreage:maxdepth  1.063e-06  2.399e-06   0.443  0.65993   
## acreage:no3      -1.860e-05  3.016e-05  -0.617  0.54057   
## acreage:so4       1.568e-05  1.315e-05   1.192  0.23931   
## acreage:temp      7.263e-06  1.697e-05   0.428  0.67071   
## dO2:maxdepth     -1.478e-02  1.478e-02  -1.000  0.32248   
## dO2:no3          -5.095e-02  2.867e-01  -0.178  0.85974   
## dO2:so4           1.021e-01  8.758e-02   1.166  0.24966   
## dO2:temp         -6.416e-02  1.034e-01  -0.621  0.53785   
## maxdepth:no3      6.118e-03  9.920e-03   0.617  0.54046   
## maxdepth:so4     -7.029e-03  2.448e-03  -2.872  0.00615 **
## maxdepth:temp     6.953e-03  6.417e-03   1.084  0.28422   
## no3:so4           6.959e-03  3.333e-02   0.209  0.83553   
## no3:temp         -7.861e-03  9.412e-02  -0.084  0.93380   
## so4:temp         -1.306e-02  2.748e-02  -0.475  0.63673   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.533 on 46 degrees of freedom
## Multiple R-squared:  0.5462, Adjusted R-squared:  0.3391 
## F-statistic: 2.637 on 21 and 46 DF,  p-value: 0.003049

Adding or removing variables

Another approach that might be useful is to construct a model and then add or remove variables from that model.

m <- lm(sqrt(count) ~ no3, data = longnosedace)
summary(m)
## 
## Call:
## lm(formula = sqrt(count) ~ no3, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3751 -1.9991 -0.1519  0.9806  9.6578 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7547     0.6335   5.927 1.24e-07 ***
## no3           0.6185     0.1963   3.150  0.00245 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.927 on 66 degrees of freedom
## Multiple R-squared:  0.1307, Adjusted R-squared:  0.1175 
## F-statistic: 9.924 on 1 and 66 DF,  p-value: 0.002452
mA <- update(m, ~ . + acreage)
summary(mA)
## 
## Call:
## lm(formula = sqrt(count) ~ no3 + acreage, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9154 -1.8807 -0.3148  1.3880  9.4343 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.3053896  0.6785796   3.397 0.001166 ** 
## no3         0.6890353  0.1782946   3.865 0.000259 ***
## acreage     0.0001667  0.0000419   3.978 0.000178 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.645 on 65 degrees of freedom
## Multiple R-squared:  0.3009, Adjusted R-squared:  0.2794 
## F-statistic: 13.99 on 2 and 65 DF,  p-value: 8.872e-06
mR <- update(mA, ~ . - no3)
summary(mR)
## 
## Call:
## lm(formula = sqrt(count) ~ acreage, data = longnosedace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8406 -2.0205 -0.2371  1.5548 10.3053 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.269e+00  4.951e-01   8.622 2.04e-12 ***
## acreage     1.505e-04  4.588e-05   3.281  0.00165 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.911 on 66 degrees of freedom
## Multiple R-squared:  0.1402, Adjusted R-squared:  0.1272 
## F-statistic: 10.77 on 1 and 66 DF,  p-value: 0.001654

You can find out more information by looking at

?formula

Activities

Practice these skills by taking a look at the data sets in the Sleuth3 package.

help(package="Sleuth3")

The data sets are organized by

  1. textbook usage ([case] study or [ex]ample)
  2. chapter, e.g. case0701 is chapter 7
  3. number, e.g. case0701 is number 1.

The Statistical Sleuth 3rd ed by Ramsey and Schafer begins discussing regression in Chapter 7 (although ANOVA is discussed in Chapter 5). Thus, data sets in Chapters 5 through 14 are relevant for exploration. When you get a new data set ask yourself these questions in this order

  1. What is the response variable?
  2. How many explanatory variables are there?
  3. Is the explanatory variable continuous or categorical?
  4. How many levels does each categorical explanatory variable have?