To follow along, use the lab08 code and make sure the following package is installed:

You can use the following code to perform the installation:

install.packages("Sleuth3")

Now load the packages

library("tidyverse")
library("Sleuth3")

set.seed(20210413) # for reproducibility

Overview

Categorical variables in R are represented either by a

A factor is a special character variable that

Characters

Characters is how R represents any non-numeric data.

is.character("a")
## [1] TRUE
is.character(1)
## [1] FALSE
is.character(pi)
## [1] FALSE
is.character("a123")
## [1] TRUE

R is case sensitive so

unique(c("a","a","A","A"))
## [1] "a" "A"

If you sort a character vector, you will get alphabetical ordering with the lowercase version of a character appearing before the uppercase version.

sort(sample(c(letters,LETTERS), 64, replace = TRUE))
##  [1] "a" "a" "a" "B" "c" "C" "d" "e" "E" "E" "f" "g" "g" "G" "h" "i" "i" "i" "i" "j" "J"
## [22] "J" "J" "l" "L" "L" "m" "M" "n" "N" "N" "o" "O" "O" "p" "p" "P" "r" "r" "R" "s" "S"
## [43] "S" "t" "T" "T" "u" "U" "v" "v" "w" "W" "W" "x" "x" "x" "X" "X" "y" "y" "y" "y" "Y"
## [64] "z"

In real data sets, explanatory variables will typically not be represented by single letters but instead will be strings (although these are still “characters” in terms of R’s data type).

Let’s take a look at the case0501 data set, but suppose that it is available from a file, just like it would be for your research.

write.csv(Sleuth3::case0501, file = "case0501.csv",
          row.names = FALSE)

Now we will read this file

d = read.csv("case0501.csv")
str(d)
## 'data.frame':    349 obs. of  2 variables:
##  $ Lifetime: num  35.5 35.4 34.9 34.8 33.8 33.5 32.6 32.4 31.8 31.6 ...
##  $ Diet    : chr  "NP" "NP" "NP" "NP" ...

Note that Lifetime is read in as a numeric variable while Diet is a character. Let’s take a look at Diet:

sort(unique(d$Diet))
## [1] "lopro" "N/N85" "N/R40" "N/R50" "NP"    "R/R50"

This tells us the levels of this explanatory variable and the ordering that R gives it. Importantly, the level lopro is the first alphabetically. This will come in later when we start running regressions.

Let’s understand a bit more about this variable

summary(d$Diet) # really uninformative
##    Length     Class      Mode 
##       349 character character
table(d$Diet)   # number of observations of each type
## 
## lopro N/N85 N/R40 N/R50    NP R/R50 
##    56    57    60    71    49    56

Unfortunately, the summary() function is really unhelpful. Typically table() is what you will want as this provides the number of observations of each type.

Character Activity

Use the ex0518 data set from the Sleuth3 package to practice writing to a csv file and reading from the csv file. For this activity, assume Protein is the response and Treatment is the only explanatory variable.

  1. Verify that Treatment is read in as a character
  2. Determine the unique values of Treatment
  3. Determine which value of Treatment comes first alphabetically
  4. Determine how many observations there are for each value of Treatment
Click for solution
write.csv(Sleuth3::ex0518, file = "ex0518.csv", row.names = FALSE)
d2 = read.csv("ex0518.csv")

is.character(d2$Treatment)
## [1] TRUE
unique(d2$Treament)
## NULL
sort(unique(d2$Treatment)) # first alphabetically
## [1] "Control" "CPFA150" "CPFA300" "CPFA450" "CPFA50"  "CPFA600"
table(d2$Treatment)
## 
## Control CPFA150 CPFA300 CPFA450  CPFA50 CPFA600 
##      15       3       3       3       3       3

Factors

A character variable can be converted to a factor.

d = read.csv(file = "case0501.csv") # Making sure we are all on the same page
d$Diet = factor(d$Diet)
is.character(d$Diet)
## [1] FALSE
is.factor(d$Diet)
## [1] TRUE

Diet is now a factor, but it almost looks like nothing has changed.

d$Diet
##   [1] NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP   
##  [14] NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP   
##  [27] NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    NP   
##  [40] NP    NP    NP    NP    NP    NP    NP    NP    NP    NP    N/N85 N/N85 N/N85
##  [53] N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85
##  [66] N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85
##  [79] N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85
##  [92] N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85 N/N85
## [105] N/N85 N/N85 lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro
## [118] lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro
## [131] lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro
## [144] lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro lopro
## [157] lopro lopro lopro lopro lopro lopro N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50
## [170] N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50
## [183] N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50
## [196] N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50
## [209] N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50
## [222] N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 N/R50 R/R50
## [235] R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50
## [248] R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50
## [261] R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50
## [274] R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50 R/R50
## [287] R/R50 R/R50 R/R50 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40
## [300] N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40
## [313] N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40
## [326] N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40
## [339] N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40 N/R40
## Levels: lopro N/N85 N/R40 N/R50 NP R/R50

Except there are no quotes and at the bottom there is the line that indicates the levels for this factor.

There is actually an internal representation via integers:

as.numeric(d$Diet)
##   [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
##  [42] 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [83] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [124] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4
## [165] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [206] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6
## [247] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [288] 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [329] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

This representation can be accessed using levels():

nlevels(d$Diet)
## [1] 6
levels(d$Diet)
## [1] "lopro" "N/N85" "N/R40" "N/R50" "NP"    "R/R50"

The default ordering is alphabetical.

When you run summary() or table(), R uses this ordering for displaying results:

summary(case0501$Diet)
## N/N85 N/R40 N/R50    NP R/R50 lopro 
##    57    60    71    49    56    56
table(  case0501$Diet)
## 
## N/N85 N/R40 N/R50    NP R/R50 lopro 
##    57    60    71    49    56    56

You can convert factor variables back into character variables

d$Diet_char <- as.character(d$Diet)
str(d)
## 'data.frame':    349 obs. of  3 variables:
##  $ Lifetime : num  35.5 35.4 34.9 34.8 33.8 33.5 32.6 32.4 31.8 31.6 ...
##  $ Diet     : Factor w/ 6 levels "lopro","N/N85",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Diet_char: chr  "NP" "NP" "NP" "NP" ...

and characters back to factors

d$Diet_fact <- factor(d$Diet_char)
str(d)
## 'data.frame':    349 obs. of  4 variables:
##  $ Lifetime : num  35.5 35.4 34.9 34.8 33.8 33.5 32.6 32.4 31.8 31.6 ...
##  $ Diet     : Factor w/ 6 levels "lopro","N/N85",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Diet_char: chr  "NP" "NP" "NP" "NP" ...
##  $ Diet_fact: Factor w/ 6 levels "lopro","N/N85",..: 5 5 5 5 5 5 5 5 5 5 ...
all.equal(d$Diet, d$Diet_fact)
## [1] TRUE

Modifying the order of factors

When creating factors you can choose the ordering using the levels argument of the factor() function.

d$Diet_fact <- factor(d$Diet_char, 
                      levels = c("N/N85", "lopro", "N/R40", 
                                 "N/R50", "NP",    "R/R50"))
str(d)
## 'data.frame':    349 obs. of  4 variables:
##  $ Lifetime : num  35.5 35.4 34.9 34.8 33.8 33.5 32.6 32.4 31.8 31.6 ...
##  $ Diet     : Factor w/ 6 levels "lopro","N/N85",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Diet_char: chr  "NP" "NP" "NP" "NP" ...
##  $ Diet_fact: Factor w/ 6 levels "N/N85","lopro",..: 5 5 5 5 5 5 5 5 5 5 ...
all.equal(d$Diet, d$Diet_fact) # no longer equal
## [1] "Attributes: < Component \"levels\": 2 string mismatches >"

Alternatively, you can move one of the levels to be first. This will be helpful when running regression analyses. We can use the relevel() function to choose a new “reference” level.

d$Diet_fact = relevel(d$Diet_fact, ref = "NP")
levels(d$Diet_fact)
## [1] "NP"    "N/N85" "lopro" "N/R40" "N/R50" "R/R50"

Notice that “NP” is now first.

Activity

Read in the ex0518 data set from the file you created in the previous activity. Convert Treatment into a factor variable.

  1. Verify that Treatment is now a factor
  2. Determine the unique levels of Treatment
  3. Determine which level of Treatment comes first
  4. Determine how many observations there are for each level of Treatment
  5. Change the ordering of the levels so that “CPFA50” is first.
Click for solution

Here is what I would do

d2 = read.csv("ex0518.csv") %>%
  mutate(Treatment = factor(Treatment))

is.factor(d2$Treatment)
levels(d2$Treatment)
levels(d2$Treatment)[1] # first alphabetically
summary(d2$Treatment)

d2$Treatment = relevel(d2$Treatment, ref = "CPFA50")
levels(d2$Treatment)

Using categorical variables in a regression

Let’s make sure we have the correct data.frame for the following.

d <- Sleuth3::case0501 %>%
  mutate(Diet_char = as.character(Diet),
         Diet_fact = as.factor(Diet_char))

Constructing dummy variable by hand

Recall that the approach to including categorical explanatory variables in a regression involves

  1. Choosing a reference level
  2. Creating dummy variables for all other levels
  3. Running a multiple regression using these dummy variables

Let’s suppose choose “N/N85” as our reference level because this is the standard Diet that mice get. Now we need to create dummy variables for all other levels.

d = d %>%
  mutate(X1 = Diet == "N/R40",
         X2 = Diet == "N/R50",
         X3 = Diet == "NP",
         X4 = Diet == "R/R50",
         X5 = Diet == "lopro")

m = lm(Lifetime ~ X1 + X2 + X3 + X4 + X5, data = d)
summary(m)
## 
## Call:
## lm(formula = Lifetime ~ X1 + X2 + X3 + X4 + X5, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.6912     0.8846  36.958  < 2e-16 ***
## X1TRUE       12.4254     1.2352  10.059  < 2e-16 ***
## X2TRUE        9.6060     1.1877   8.088 1.06e-14 ***
## X3TRUE       -5.2892     1.3010  -4.065 5.95e-05 ***
## X4TRUE       10.1945     1.2565   8.113 8.88e-15 ***
## X5TRUE        6.9945     1.2565   5.567 5.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

By construction, 32.7 is the expected lifetime on the N/N85 Diet and 12.4 is the expected difference in lifetimes between the N/N85 and lopro Diets (since it is positive expected lifetime on lopro is higher).

This was pretty tedious. Fortunately, R can do all the work for us.

Using a character vector as the explanatory variable

Let’s perform the Lifetime-Diet regression using the character vector for Diet.

m <- lm(Lifetime ~ Diet_char, data = d)
summary(m)
## 
## Call:
## lm(formula = Lifetime ~ Diet_char, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     39.6857     0.8924  44.470  < 2e-16 ***
## Diet_charN/N85  -6.9945     1.2565  -5.567 5.25e-08 ***
## Diet_charN/R40   5.4310     1.2409   4.377 1.60e-05 ***
## Diet_charN/R50   2.6115     1.1936   2.188   0.0293 *  
## Diet_charNP    -12.2837     1.3064  -9.403  < 2e-16 ***
## Diet_charR/R50   3.2000     1.2621   2.536   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

Notice that it automatically constructed all our indicator variables. It automatically chose the lopro diet as the reference level (because it came first alphabetically).

If you leave Diet as a character vector then you cannot control the reference level.

Using a factor vector as the explanatory variable

Let’s perform the Lifetime-Diet regression using the factor vector for Diet.

m <- lm(Lifetime ~ Diet_fact, data = d)
summary(m)
## 
## Call:
## lm(formula = Lifetime ~ Diet_fact, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     39.6857     0.8924  44.470  < 2e-16 ***
## Diet_factN/N85  -6.9945     1.2565  -5.567 5.25e-08 ***
## Diet_factN/R40   5.4310     1.2409   4.377 1.60e-05 ***
## Diet_factN/R50   2.6115     1.1936   2.188   0.0293 *  
## Diet_factNP    -12.2837     1.3064  -9.403  < 2e-16 ***
## Diet_factR/R50   3.2000     1.2621   2.536   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

These results are identical to the results using the character vector.

Adjusting the reference level

Now suppose you want to run the regression using N/N85 (normal diet) as the reference level. Do this using the relevel() function and its ref argument.

d$Diet_fact <- relevel(d$Diet_fact, ref = "N/N85")
table(d$Diet_fact)
## 
## N/N85 lopro N/R40 N/R50    NP R/R50 
##    57    56    60    71    49    56

Notice how N/N85 is now the first level.

Now when you run the regression the reference level will be N/N85.

m <- lm(Lifetime ~ Diet_fact, data = d)
summary(m)
## 
## Call:
## lm(formula = Lifetime ~ Diet_fact, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     32.6912     0.8846  36.958  < 2e-16 ***
## Diet_factlopro   6.9945     1.2565   5.567 5.25e-08 ***
## Diet_factN/R40  12.4254     1.2352  10.059  < 2e-16 ***
## Diet_factN/R50   9.6060     1.1877   8.088 1.06e-14 ***
## Diet_factNP     -5.2892     1.3010  -4.065 5.95e-05 ***
## Diet_factR/R50  10.1945     1.2565   8.113 8.88e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

The authors of the Sleuth3 package had already made N/N85 the reference level so if you run the regression with the original data, N/N85 will be the reference level.

m <- lm(Lifetime ~ Diet, data = Sleuth3::case0501)
summary(m)
## 
## Call:
## lm(formula = Lifetime ~ Diet, data = Sleuth3::case0501)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5167  -3.3857   0.8143   5.1833  10.0143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.6912     0.8846  36.958  < 2e-16 ***
## DietN/R40    12.4254     1.2352  10.059  < 2e-16 ***
## DietN/R50     9.6060     1.1877   8.088 1.06e-14 ***
## DietNP       -5.2892     1.3010  -4.065 5.95e-05 ***
## DietR/R50    10.1945     1.2565   8.113 8.88e-15 ***
## Dietlopro     6.9945     1.2565   5.567 5.25e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 343 degrees of freedom
## Multiple R-squared:  0.4543, Adjusted R-squared:  0.4463 
## F-statistic:  57.1 on 5 and 343 DF,  p-value: < 2.2e-16

Data analysis activity

Use the ex0518 data.frame from the Sleuth3 package.

Adjust the order of the levels of treatment to have increasing CPFA and make “Control” the reference level. Create a plot of protein versus treatment. Run a regression of protein on treatment to determine which CPFA levels have a mean protein level that is significantly different from the control. Look at the diagnostic plots to determine if there are any clear violations of model assumptions.

Click for solution

Get the data and adjust the levels of treatment.

d <- Sleuth3::ex0518 %>%
  mutate(Treatment = factor(Treatment, 
                            levels = c("Control","CPFA50","CPFA150",
                                       "CPFA300","CPFA450","CPFA600")))

Now create the plot.

library(ggplot2)
ggplot(d, aes(Treatment, Protein)) +
  geom_jitter(height=0, width=.25) +
  theme_bw()

Notice that the ordering on the x-axis is according to the ordering of the levels of treatment.

Now run the regression.

m <- lm(Protein ~ Treatment, data = d)
summary(m)
## 
## Call:
## lm(formula = Protein ~ Treatment, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.600  -9.417   3.400   9.300  30.400 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       185.600      4.441  41.792  < 2e-16 ***
## TreatmentCPFA50   -17.267     10.878  -1.587  0.12554    
## TreatmentCPFA150  -13.933     10.878  -1.281  0.21249    
## TreatmentCPFA300  -38.933     10.878  -3.579  0.00151 ** 
## TreatmentCPFA450  -34.600     10.878  -3.181  0.00402 ** 
## TreatmentCPFA600  -33.267     10.878  -3.058  0.00540 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.2 on 24 degrees of freedom
## Multiple R-squared:  0.5043, Adjusted R-squared:  0.401 
## F-statistic: 4.883 on 5 and 24 DF,  p-value: 0.003196

Based on this regression analysis, CPFA levels at 300 or above have significantly different mean protein levels.

Finally look at the diagnostic plots.

opar = par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

There is no clear violation here.

Regression using continuous explanatory variable activity

Use the previous data to construct a continuous explanatory variable that has the CPFA level. Then run a regression using this explanatory variable and assess model assumptions using a diagnostic plot.

Click for solution

I’m not necessarily expecting you to come up with this approach that uses gsub() and as.numeric().

d <- d %>%
  mutate(cpfa = as.numeric(gsub("CPFA","", Treatment)),
         cpfa = ifelse(Treatment == "Control", 0, cpfa))
## Warning: Problem with `mutate()` input `cpfa`.
## i NAs introduced by coercion
## i Input `cpfa` is `as.numeric(gsub("CPFA", "", Treatment))`.
summary(d)
##     Protein        Treatment    Day     TrtDayGroup      cpfa    
##  Min.   :124.0   Control:15   Day1:6   Group1 : 3   Min.   :  0  
##  1st Qu.:157.0   CPFA50 : 3   Day2:6   Group10: 3   1st Qu.:  0  
##  Median :164.5   CPFA150: 3   Day3:6   Group2 : 3   Median : 25  
##  Mean   :171.8   CPFA300: 3   Day4:6   Group3 : 3   Mean   :155  
##  3rd Qu.:190.8   CPFA450: 3   Day5:6   Group4 : 3   3rd Qu.:300  
##  Max.   :216.0   CPFA600: 3            Group5 : 3   Max.   :600  
##                                        (Other):12

Run the regression

m <- lm(Protein ~ cpfa, data = d)

opar = par(mfrow=c(2,3))
plot(m, 1:6, ask = FALSE)

par(opar)

plot(d$cpfa, m$residuals)

If you want to clean up your directory, run the following lines

if (file.exists("case0501.csv")) file.remove("case0501.csv")
## [1] TRUE
if (file.exists("ex0518.csv"))   file.remove("ex0518.csv")
## [1] TRUE