R code

We will start with loading up the necessary packages using the library() command.

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data transformation

The dplyr package provides a number of convenience functions for dealing with data.frame objects in R. As a reminder, a data.frame is a special matrix that allows different columns to have different data types.

Although, we will focus on the dplyr package, you may want to look up the data.table package. Generally, dplyr tries to make the code readable while data.table tries to make the most efficient code.

Technically, the output of a dplyr function is a tibble which is a specialized data.frame.

tibble

?tibble
ToothGrowth
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
## 16 17.3   VC  1.0
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 19 18.8   VC  1.0
## 20 15.5   VC  1.0
## 21 23.6   VC  2.0
## 22 18.5   VC  2.0
## 23 33.9   VC  2.0
## 24 25.5   VC  2.0
## 25 26.4   VC  2.0
## 26 32.5   VC  2.0
## 27 26.7   VC  2.0
## 28 21.5   VC  2.0
## 29 23.3   VC  2.0
## 30 29.5   VC  2.0
## 31 15.2   OJ  0.5
## 32 21.5   OJ  0.5
## 33 17.6   OJ  0.5
## 34  9.7   OJ  0.5
## 35 14.5   OJ  0.5
## 36 10.0   OJ  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 39 16.5   OJ  0.5
## 40  9.7   OJ  0.5
## 41 19.7   OJ  1.0
## 42 23.3   OJ  1.0
## 43 23.6   OJ  1.0
## 44 26.4   OJ  1.0
## 45 20.0   OJ  1.0
## 46 25.2   OJ  1.0
## 47 25.8   OJ  1.0
## 48 21.2   OJ  1.0
## 49 14.5   OJ  1.0
## 50 27.3   OJ  1.0
## 51 25.5   OJ  2.0
## 52 26.4   OJ  2.0
## 53 22.4   OJ  2.0
## 54 24.5   OJ  2.0
## 55 24.8   OJ  2.0
## 56 30.9   OJ  2.0
## 57 26.4   OJ  2.0
## 58 27.3   OJ  2.0
## 59 29.4   OJ  2.0
## 60 23.0   OJ  2.0

vs

as_tibble(ToothGrowth)
## # A tibble: 60 × 3
##      len supp   dose
##    <dbl> <fct> <dbl>
##  1   4.2 VC      0.5
##  2  11.5 VC      0.5
##  3   7.3 VC      0.5
##  4   5.8 VC      0.5
##  5   6.4 VC      0.5
##  6  10   VC      0.5
##  7  11.2 VC      0.5
##  8  11.2 VC      0.5
##  9   5.2 VC      0.5
## 10   7   VC      0.5
## # … with 50 more rows

Filtering observations

Filter means to remove a set of observations according to some criterion.

base R

ToothGrowth[ToothGrowth$supp == "VC", ]
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
## 16 17.3   VC  1.0
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 19 18.8   VC  1.0
## 20 15.5   VC  1.0
## 21 23.6   VC  2.0
## 22 18.5   VC  2.0
## 23 33.9   VC  2.0
## 24 25.5   VC  2.0
## 25 26.4   VC  2.0
## 26 32.5   VC  2.0
## 27 26.7   VC  2.0
## 28 21.5   VC  2.0
## 29 23.3   VC  2.0
## 30 29.5   VC  2.0
subset(ToothGrowth, supp == "VC")
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
## 16 17.3   VC  1.0
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 19 18.8   VC  1.0
## 20 15.5   VC  1.0
## 21 23.6   VC  2.0
## 22 18.5   VC  2.0
## 23 33.9   VC  2.0
## 24 25.5   VC  2.0
## 25 26.4   VC  2.0
## 26 32.5   VC  2.0
## 27 26.7   VC  2.0
## 28 21.5   VC  2.0
## 29 23.3   VC  2.0
## 30 29.5   VC  2.0

or

ToothGrowth[ToothGrowth$len < 10, ]
##    len supp dose
## 1  4.2   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 9  5.2   VC  0.5
## 10 7.0   VC  0.5
## 34 9.7   OJ  0.5
## 37 8.2   OJ  0.5
## 38 9.4   OJ  0.5
## 40 9.7   OJ  0.5
subset(ToothGrowth, len < 10)
##    len supp dose
## 1  4.2   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 9  5.2   VC  0.5
## 10 7.0   VC  0.5
## 34 9.7   OJ  0.5
## 37 8.2   OJ  0.5
## 38 9.4   OJ  0.5
## 40 9.7   OJ  0.5
subset(ToothGrowth, len < 10 & supp == "VC") # using the AND operator &
##    len supp dose
## 1  4.2   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 9  5.2   VC  0.5
## 10 7.0   VC  0.5

dplyr::filter

Note that dplyr::filter masks stats::filter(). We will be explicit here and call the filter function from the dplyr package using the double colon operator.

?`::`
dplyr::filter(ToothGrowth, supp == "VC")
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
## 16 17.3   VC  1.0
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 19 18.8   VC  1.0
## 20 15.5   VC  1.0
## 21 23.6   VC  2.0
## 22 18.5   VC  2.0
## 23 33.9   VC  2.0
## 24 25.5   VC  2.0
## 25 26.4   VC  2.0
## 26 32.5   VC  2.0
## 27 26.7   VC  2.0
## 28 21.5   VC  2.0
## 29 23.3   VC  2.0
## 30 29.5   VC  2.0
dplyr::filter(ToothGrowth, len < 10)
##    len supp dose
## 1  4.2   VC  0.5
## 2  7.3   VC  0.5
## 3  5.8   VC  0.5
## 4  6.4   VC  0.5
## 5  5.2   VC  0.5
## 6  7.0   VC  0.5
## 7  9.7   OJ  0.5
## 8  8.2   OJ  0.5
## 9  9.4   OJ  0.5
## 10 9.7   OJ  0.5
dplyr::filter(ToothGrowth, len < 10, supp == "VC")
##   len supp dose
## 1 4.2   VC  0.5
## 2 7.3   VC  0.5
## 3 5.8   VC  0.5
## 4 6.4   VC  0.5
## 5 5.2   VC  0.5
## 6 7.0   VC  0.5

arrange

Arrange arranges the data.frame according to a variable.

base R

ToothGrowth[order(ToothGrowth$len), ]
##     len supp dose
## 1   4.2   VC  0.5
## 9   5.2   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 10  7.0   VC  0.5
## 3   7.3   VC  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 34  9.7   OJ  0.5
## 40  9.7   OJ  0.5
## 6  10.0   VC  0.5
## 36 10.0   OJ  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 2  11.5   VC  0.5
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 35 14.5   OJ  0.5
## 49 14.5   OJ  1.0
## 13 15.2   VC  1.0
## 31 15.2   OJ  0.5
## 20 15.5   VC  1.0
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 39 16.5   OJ  0.5
## 14 17.3   VC  1.0
## 16 17.3   VC  1.0
## 33 17.6   OJ  0.5
## 22 18.5   VC  2.0
## 19 18.8   VC  1.0
## 41 19.7   OJ  1.0
## 45 20.0   OJ  1.0
## 48 21.2   OJ  1.0
## 28 21.5   VC  2.0
## 32 21.5   OJ  0.5
## 53 22.4   OJ  2.0
## 15 22.5   VC  1.0
## 60 23.0   OJ  2.0
## 29 23.3   VC  2.0
## 42 23.3   OJ  1.0
## 21 23.6   VC  2.0
## 43 23.6   OJ  1.0
## 54 24.5   OJ  2.0
## 55 24.8   OJ  2.0
## 46 25.2   OJ  1.0
## 24 25.5   VC  2.0
## 51 25.5   OJ  2.0
## 47 25.8   OJ  1.0
## 25 26.4   VC  2.0
## 44 26.4   OJ  1.0
## 52 26.4   OJ  2.0
## 57 26.4   OJ  2.0
## 27 26.7   VC  2.0
## 50 27.3   OJ  1.0
## 58 27.3   OJ  2.0
## 59 29.4   OJ  2.0
## 30 29.5   VC  2.0
## 56 30.9   OJ  2.0
## 26 32.5   VC  2.0
## 23 33.9   VC  2.0

What happens if we want to arrange by supp and then by dose? I don’t know.

dplyr::arrange

arrange(ToothGrowth, len)
##     len supp dose
## 1   4.2   VC  0.5
## 2   5.2   VC  0.5
## 3   5.8   VC  0.5
## 4   6.4   VC  0.5
## 5   7.0   VC  0.5
## 6   7.3   VC  0.5
## 7   8.2   OJ  0.5
## 8   9.4   OJ  0.5
## 9   9.7   OJ  0.5
## 10  9.7   OJ  0.5
## 11 10.0   VC  0.5
## 12 10.0   OJ  0.5
## 13 11.2   VC  0.5
## 14 11.2   VC  0.5
## 15 11.5   VC  0.5
## 16 13.6   VC  1.0
## 17 14.5   VC  1.0
## 18 14.5   OJ  0.5
## 19 14.5   OJ  1.0
## 20 15.2   VC  1.0
## 21 15.2   OJ  0.5
## 22 15.5   VC  1.0
## 23 16.5   VC  1.0
## 24 16.5   VC  1.0
## 25 16.5   OJ  0.5
## 26 17.3   VC  1.0
## 27 17.3   VC  1.0
## 28 17.6   OJ  0.5
## 29 18.5   VC  2.0
## 30 18.8   VC  1.0
## 31 19.7   OJ  1.0
## 32 20.0   OJ  1.0
## 33 21.2   OJ  1.0
## 34 21.5   VC  2.0
## 35 21.5   OJ  0.5
## 36 22.4   OJ  2.0
## 37 22.5   VC  1.0
## 38 23.0   OJ  2.0
## 39 23.3   VC  2.0
## 40 23.3   OJ  1.0
## 41 23.6   VC  2.0
## 42 23.6   OJ  1.0
## 43 24.5   OJ  2.0
## 44 24.8   OJ  2.0
## 45 25.2   OJ  1.0
## 46 25.5   VC  2.0
## 47 25.5   OJ  2.0
## 48 25.8   OJ  1.0
## 49 26.4   VC  2.0
## 50 26.4   OJ  1.0
## 51 26.4   OJ  2.0
## 52 26.4   OJ  2.0
## 53 26.7   VC  2.0
## 54 27.3   OJ  1.0
## 55 27.3   OJ  2.0
## 56 29.4   OJ  2.0
## 57 29.5   VC  2.0
## 58 30.9   OJ  2.0
## 59 32.5   VC  2.0
## 60 33.9   VC  2.0
arrange(ToothGrowth, supp, dose)
##     len supp dose
## 1  15.2   OJ  0.5
## 2  21.5   OJ  0.5
## 3  17.6   OJ  0.5
## 4   9.7   OJ  0.5
## 5  14.5   OJ  0.5
## 6  10.0   OJ  0.5
## 7   8.2   OJ  0.5
## 8   9.4   OJ  0.5
## 9  16.5   OJ  0.5
## 10  9.7   OJ  0.5
## 11 19.7   OJ  1.0
## 12 23.3   OJ  1.0
## 13 23.6   OJ  1.0
## 14 26.4   OJ  1.0
## 15 20.0   OJ  1.0
## 16 25.2   OJ  1.0
## 17 25.8   OJ  1.0
## 18 21.2   OJ  1.0
## 19 14.5   OJ  1.0
## 20 27.3   OJ  1.0
## 21 25.5   OJ  2.0
## 22 26.4   OJ  2.0
## 23 22.4   OJ  2.0
## 24 24.5   OJ  2.0
## 25 24.8   OJ  2.0
## 26 30.9   OJ  2.0
## 27 26.4   OJ  2.0
## 28 27.3   OJ  2.0
## 29 29.4   OJ  2.0
## 30 23.0   OJ  2.0
## 31  4.2   VC  0.5
## 32 11.5   VC  0.5
## 33  7.3   VC  0.5
## 34  5.8   VC  0.5
## 35  6.4   VC  0.5
## 36 10.0   VC  0.5
## 37 11.2   VC  0.5
## 38 11.2   VC  0.5
## 39  5.2   VC  0.5
## 40  7.0   VC  0.5
## 41 16.5   VC  1.0
## 42 16.5   VC  1.0
## 43 15.2   VC  1.0
## 44 17.3   VC  1.0
## 45 22.5   VC  1.0
## 46 17.3   VC  1.0
## 47 13.6   VC  1.0
## 48 14.5   VC  1.0
## 49 18.8   VC  1.0
## 50 15.5   VC  1.0
## 51 23.6   VC  2.0
## 52 18.5   VC  2.0
## 53 33.9   VC  2.0
## 54 25.5   VC  2.0
## 55 26.4   VC  2.0
## 56 32.5   VC  2.0
## 57 26.7   VC  2.0
## 58 21.5   VC  2.0
## 59 23.3   VC  2.0
## 60 29.5   VC  2.0

Selecting variables

Here we will select which columns to keep.

base::subset

ToothGrowth[, c("len","dose")]
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
ToothGrowth[, c(1,3)]
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
ToothGrowth[, -2]
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
subset(ToothGrowth, select = c(len, dose))
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0

dplyr::select

dplyr::select(ToothGrowth, len, dose)
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
dplyr::select(ToothGrowth, c(1,3))
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
dplyr::select(ToothGrowth, -2)
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
dplyr::select(ToothGrowth, -supp)
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0

tidyselect::starts_with()

When data.frames get large, it is annoying to type all the variables you want to keep. Often times we want to keep a collection of variables that have certain properties in their name.

select(ToothGrowth, starts_with("len"))
##     len
## 1   4.2
## 2  11.5
## 3   7.3
## 4   5.8
## 5   6.4
## 6  10.0
## 7  11.2
## 8  11.2
## 9   5.2
## 10  7.0
## 11 16.5
## 12 16.5
## 13 15.2
## 14 17.3
## 15 22.5
## 16 17.3
## 17 13.6
## 18 14.5
## 19 18.8
## 20 15.5
## 21 23.6
## 22 18.5
## 23 33.9
## 24 25.5
## 25 26.4
## 26 32.5
## 27 26.7
## 28 21.5
## 29 23.3
## 30 29.5
## 31 15.2
## 32 21.5
## 33 17.6
## 34  9.7
## 35 14.5
## 36 10.0
## 37  8.2
## 38  9.4
## 39 16.5
## 40  9.7
## 41 19.7
## 42 23.3
## 43 23.6
## 44 26.4
## 45 20.0
## 46 25.2
## 47 25.8
## 48 21.2
## 49 14.5
## 50 27.3
## 51 25.5
## 52 26.4
## 53 22.4
## 54 24.5
## 55 24.8
## 56 30.9
## 57 26.4
## 58 27.3
## 59 29.4
## 60 23.0

There are a variety of these helper functions, see

?starts_with

Renaming variables

Many times variable names needed to be modified.

base::names()

d <- ToothGrowth
names(d)
## [1] "len"  "supp" "dose"
names(d) <- c("Length","Delivery","Dose")
head(d)
##   Length Delivery Dose
## 1    4.2       VC  0.5
## 2   11.5       VC  0.5
## 3    7.3       VC  0.5
## 4    5.8       VC  0.5
## 5    6.4       VC  0.5
## 6   10.0       VC  0.5

Valid variable names, see

?make.names
names(d)[2] <- "Delivery Method"
names(d)
## [1] "Length"          "Delivery Method" "Dose"

This can get you into trouble

d$Delivery Method
## Error: <text>:1:12: unexpected symbol
## 1: d$Delivery Method
##                ^

but you can use “bad” variable names

d$`Delivery Method`
##  [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
## [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC

dplyr::rename()

d <- ToothGrowth
names(d)
## [1] "len"  "supp" "dose"
d <- rename(d,
  Length = len,
  `Delivery Method` = supp,
  Dose = dose
)
names(d)
## [1] "Length"          "Delivery Method" "Dose"

Adding and modifying variables

We will often want to create new variables or modify existing variables.

base::$

We have already seen the extract operator to extract a column out of a data.frame. Here we will use it to create a new variable.

d <- ToothGrowth
d$`Length (mm)` = d$len / 1000 # original length was in microns
head(d)

dplyr::mutate()

mutate(ToothGrowth, `Length (mm)` = len)
##     len supp dose Length (mm)
## 1   4.2   VC  0.5         4.2
## 2  11.5   VC  0.5        11.5
## 3   7.3   VC  0.5         7.3
## 4   5.8   VC  0.5         5.8
## 5   6.4   VC  0.5         6.4
## 6  10.0   VC  0.5        10.0
## 7  11.2   VC  0.5        11.2
## 8  11.2   VC  0.5        11.2
## 9   5.2   VC  0.5         5.2
## 10  7.0   VC  0.5         7.0
## 11 16.5   VC  1.0        16.5
## 12 16.5   VC  1.0        16.5
## 13 15.2   VC  1.0        15.2
## 14 17.3   VC  1.0        17.3
## 15 22.5   VC  1.0        22.5
## 16 17.3   VC  1.0        17.3
## 17 13.6   VC  1.0        13.6
## 18 14.5   VC  1.0        14.5
## 19 18.8   VC  1.0        18.8
## 20 15.5   VC  1.0        15.5
## 21 23.6   VC  2.0        23.6
## 22 18.5   VC  2.0        18.5
## 23 33.9   VC  2.0        33.9
## 24 25.5   VC  2.0        25.5
## 25 26.4   VC  2.0        26.4
## 26 32.5   VC  2.0        32.5
## 27 26.7   VC  2.0        26.7
## 28 21.5   VC  2.0        21.5
## 29 23.3   VC  2.0        23.3
## 30 29.5   VC  2.0        29.5
## 31 15.2   OJ  0.5        15.2
## 32 21.5   OJ  0.5        21.5
## 33 17.6   OJ  0.5        17.6
## 34  9.7   OJ  0.5         9.7
## 35 14.5   OJ  0.5        14.5
## 36 10.0   OJ  0.5        10.0
## 37  8.2   OJ  0.5         8.2
## 38  9.4   OJ  0.5         9.4
## 39 16.5   OJ  0.5        16.5
## 40  9.7   OJ  0.5         9.7
## 41 19.7   OJ  1.0        19.7
## 42 23.3   OJ  1.0        23.3
## 43 23.6   OJ  1.0        23.6
## 44 26.4   OJ  1.0        26.4
## 45 20.0   OJ  1.0        20.0
## 46 25.2   OJ  1.0        25.2
## 47 25.8   OJ  1.0        25.8
## 48 21.2   OJ  1.0        21.2
## 49 14.5   OJ  1.0        14.5
## 50 27.3   OJ  1.0        27.3
## 51 25.5   OJ  2.0        25.5
## 52 26.4   OJ  2.0        26.4
## 53 22.4   OJ  2.0        22.4
## 54 24.5   OJ  2.0        24.5
## 55 24.8   OJ  2.0        24.8
## 56 30.9   OJ  2.0        30.9
## 57 26.4   OJ  2.0        26.4
## 58 27.3   OJ  2.0        27.3
## 59 29.4   OJ  2.0        29.4
## 60 23.0   OJ  2.0        23.0

Data summaries

Often we will want to create summaries of our data.

base::mean

mean(ToothGrowth$len)
## [1] 18.81333
table(ToothGrowth$supp)
## 
## OJ VC 
## 30 30

dplyr::summarize()

dplyr::summarize(ToothGrowth, mean_len = mean(len))
##   mean_len
## 1 18.81333
dplyr::summarize(
  ToothGrowth,
  n_VC = sum(supp == "VC"),
  n_OJ = sum(supp == "OJ"),               
)
##   n_VC n_OJ
## 1   30   30

Data pipelines

According to IBM,

A data pipeline is a method in which raw data is ingested from various data sources and then ported to data store, like a data lake or data warehouse, for analysis. Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization.

%>%

The pipe operator is a simple function that takes everything on the left-hand side (lhs) of the pipe and “pipes” it as the first argument to the function on the right-hand side.

16 %>% sqrt()
## [1] 4

This becomes especially useful when used in conjuction with additional pipe operators.

256 %>% sqrt() %>% sqrt()
## [1] 4

The original pipe operator has moved around a bit, but (I believe) it is now defined in the magrittr. Now, base R has its own pipe operator as of version 4.1.0.

16 |> sqrt()
## [1] 4

Here is a brief history of the pipe operator in R.

Combining operations

ToothGrowth %>%
  filter(dose == 0.5, supp == "VC") %>%
  summarize(
    n = n(), 
    mean = mean(len),
    sd = sd(len))
##    n mean       sd
## 1 10 7.98 2.746634

split-apply-combine

Often, we want to split our data up, apply some operations to it, and then combine the splits back together. In the ToothGrowth data.frame, we may want to calculate some summary statistics for both delivery methods.

dplyr::group_by()

ToothGrowth %>%
  group_by(supp, dose) %>%
  summarize(
    n = n(),
    mean = mean(len),
    sd = sd(len),
    .groups = "drop"
  ) 
## # A tibble: 6 × 5
##   supp   dose     n  mean    sd
##   <fct> <dbl> <int> <dbl> <dbl>
## 1 OJ      0.5    10 13.2   4.46
## 2 OJ      1      10 22.7   3.91
## 3 OJ      2      10 26.1   2.66
## 4 VC      0.5    10  7.98  2.75
## 5 VC      1      10 16.8   2.52
## 6 VC      2      10 26.1   4.80

Advanced plotting

First construct necessary data.frames

s <- ToothGrowth %>%
  group_by(supp, dose) %>%
  summarize(
    n = n(),
    mean = mean(len),
    sd = sd(len),
    .groups = "drop"
  ) %>%
  mutate(
    ucl = mean + qt(.975, n-1)*sd/sqrt(n),
    lcl = mean - qt(.975, n-1)*sd/sqrt(n)
  )

Then construct the plot

dw <- 0.1
ggplot(s, 
       aes(x = dose, color = supp, shape = supp)) +
  
  geom_point(
    data = ToothGrowth,
    aes(y = len),
    position = position_jitterdodge(dodge.width = dw, jitter.width = 0.05)) +
  
  geom_pointrange(
    data = s, 
    aes(y = mean, ymin = lcl, ymax = ucl),
    position = position_dodge(width = dw), 
    shape = 0,
    show.legend = FALSE
    ) +
  
  geom_line(
    aes(y = mean, group = supp),
    position = position_dodge(width = dw)) +
  
  scale_x_log10() +
  
  labs(
    x = "Dose (mg/day)", 
    y = "Length (\u00b5m)", # unicode \u00b5 is the Greek letter mu
    title = "Odontoblast length vs Vitamin C in Guinea Pigs",
    color = "Delivery Method",
    shape = "Delivery Method") +
  
  theme_bw() +
  theme(legend.position = c(0.8, 0.2),
        legend.background = element_rect(fill = "white",
                                        color = "black"))

Advanced tables

The s data.frame is not quite ideal for creating a table. We will preview an approach called pivoting from the tidyr package.

t <- s %>%
  mutate(ci = paste0(format(mean, digits = 1, nsmall = 1), 
                    " (", 
                    format(lcl,  digits = 1, nsmall = 1),
                    ", ",
                    format(ucl,  digits = 1, nsmall = 1),
                    ")")) %>%
  tidyr::pivot_wider(id_cols = dose, names_from = supp, values_from = ci) %>%
  rename(
    `Orange Juice` = "OJ",
    `Ascorbic Acid` = "VC",
    Dose = dose
  )

t
## # A tibble: 3 × 3
##    Dose `Orange Juice`    `Ascorbic Acid`    
##   <dbl> <chr>             <chr>              
## 1   0.5 13.2 (10.0, 16.4) " 8.0 ( 6.0,  9.9)"
## 2   1   22.7 (19.9, 25.5) "16.8 (15.0, 18.6)"
## 3   2   26.1 (24.2, 28.0) "26.1 (22.7, 29.6)"

The output above is just the data.frame in R. We can construct better looking tables using a variety of methods.

library("xtable")
cap <- "Mean odonotoblast length (\u00b5m) with 95% CIs."
xt <- xtable::xtable(t, 
             caption=cap,
             align="rr|rr")
print(xt, type = "html", 
      include.rownames = FALSE,
      caption.placement = "top")
Mean odonotoblast length (µm) with 95% CIs.
Dose Orange Juice Ascorbic Acid
0.50 13.2 (10.0, 16.4) 8.0 ( 6.0, 9.9)
1.00 22.7 (19.9, 25.5) 16.8 (15.0, 18.6)
2.00 26.1 (24.2, 28.0) 26.1 (22.7, 29.6)

Using knitr::kable

library("knitr")
knitr::kable(t, caption = cap, align = "r")
Mean odonotoblast length (µm) with 95% CIs.
Dose Orange Juice Ascorbic Acid
0.5 13.2 (10.0, 16.4) 8.0 ( 6.0, 9.9)
1.0 22.7 (19.9, 25.5) 16.8 (15.0, 18.6)
2.0 26.1 (24.2, 28.0) 26.1 (22.7, 29.6)