We will start with loading up the necessary packages using the
library()
command.
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The dplyr package provides a number of convenience functions for
dealing with data.frame
objects in R. As a reminder, a
data.frame
is a special matrix that allows different
columns to have different data types.
Although, we will focus on the dplyr package, you may want to look up the data.table package. Generally, dplyr tries to make the code readable while data.table tries to make the most efficient code.
Technically, the output of a dplyr function is a tibble
which is a specialized data.frame
.
?tibble
ToothGrowth
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
## 16 17.3 VC 1.0
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 19 18.8 VC 1.0
## 20 15.5 VC 1.0
## 21 23.6 VC 2.0
## 22 18.5 VC 2.0
## 23 33.9 VC 2.0
## 24 25.5 VC 2.0
## 25 26.4 VC 2.0
## 26 32.5 VC 2.0
## 27 26.7 VC 2.0
## 28 21.5 VC 2.0
## 29 23.3 VC 2.0
## 30 29.5 VC 2.0
## 31 15.2 OJ 0.5
## 32 21.5 OJ 0.5
## 33 17.6 OJ 0.5
## 34 9.7 OJ 0.5
## 35 14.5 OJ 0.5
## 36 10.0 OJ 0.5
## 37 8.2 OJ 0.5
## 38 9.4 OJ 0.5
## 39 16.5 OJ 0.5
## 40 9.7 OJ 0.5
## 41 19.7 OJ 1.0
## 42 23.3 OJ 1.0
## 43 23.6 OJ 1.0
## 44 26.4 OJ 1.0
## 45 20.0 OJ 1.0
## 46 25.2 OJ 1.0
## 47 25.8 OJ 1.0
## 48 21.2 OJ 1.0
## 49 14.5 OJ 1.0
## 50 27.3 OJ 1.0
## 51 25.5 OJ 2.0
## 52 26.4 OJ 2.0
## 53 22.4 OJ 2.0
## 54 24.5 OJ 2.0
## 55 24.8 OJ 2.0
## 56 30.9 OJ 2.0
## 57 26.4 OJ 2.0
## 58 27.3 OJ 2.0
## 59 29.4 OJ 2.0
## 60 23.0 OJ 2.0
vs
as_tibble(ToothGrowth)
## # A tibble: 60 × 3
## len supp dose
## <dbl> <fct> <dbl>
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7 VC 0.5
## # … with 50 more rows
Filter means to remove a set of observations according to some criterion.
ToothGrowth[ToothGrowth$supp == "VC", ]
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
## 16 17.3 VC 1.0
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 19 18.8 VC 1.0
## 20 15.5 VC 1.0
## 21 23.6 VC 2.0
## 22 18.5 VC 2.0
## 23 33.9 VC 2.0
## 24 25.5 VC 2.0
## 25 26.4 VC 2.0
## 26 32.5 VC 2.0
## 27 26.7 VC 2.0
## 28 21.5 VC 2.0
## 29 23.3 VC 2.0
## 30 29.5 VC 2.0
subset(ToothGrowth, supp == "VC")
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
## 16 17.3 VC 1.0
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 19 18.8 VC 1.0
## 20 15.5 VC 1.0
## 21 23.6 VC 2.0
## 22 18.5 VC 2.0
## 23 33.9 VC 2.0
## 24 25.5 VC 2.0
## 25 26.4 VC 2.0
## 26 32.5 VC 2.0
## 27 26.7 VC 2.0
## 28 21.5 VC 2.0
## 29 23.3 VC 2.0
## 30 29.5 VC 2.0
or
ToothGrowth[ToothGrowth$len < 10, ]
## len supp dose
## 1 4.2 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 34 9.7 OJ 0.5
## 37 8.2 OJ 0.5
## 38 9.4 OJ 0.5
## 40 9.7 OJ 0.5
subset(ToothGrowth, len < 10)
## len supp dose
## 1 4.2 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 34 9.7 OJ 0.5
## 37 8.2 OJ 0.5
## 38 9.4 OJ 0.5
## 40 9.7 OJ 0.5
subset(ToothGrowth, len < 10 & supp == "VC") # using the AND operator &
## len supp dose
## 1 4.2 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
Note that dplyr::filter
masks
stats::filter()
. We will be explicit here and call the
filter
function from the dplyr
package using
the double colon operator.
?`::`
dplyr::filter(ToothGrowth, supp == "VC")
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
## 16 17.3 VC 1.0
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 19 18.8 VC 1.0
## 20 15.5 VC 1.0
## 21 23.6 VC 2.0
## 22 18.5 VC 2.0
## 23 33.9 VC 2.0
## 24 25.5 VC 2.0
## 25 26.4 VC 2.0
## 26 32.5 VC 2.0
## 27 26.7 VC 2.0
## 28 21.5 VC 2.0
## 29 23.3 VC 2.0
## 30 29.5 VC 2.0
dplyr::filter(ToothGrowth, len < 10)
## len supp dose
## 1 4.2 VC 0.5
## 2 7.3 VC 0.5
## 3 5.8 VC 0.5
## 4 6.4 VC 0.5
## 5 5.2 VC 0.5
## 6 7.0 VC 0.5
## 7 9.7 OJ 0.5
## 8 8.2 OJ 0.5
## 9 9.4 OJ 0.5
## 10 9.7 OJ 0.5
dplyr::filter(ToothGrowth, len < 10, supp == "VC")
## len supp dose
## 1 4.2 VC 0.5
## 2 7.3 VC 0.5
## 3 5.8 VC 0.5
## 4 6.4 VC 0.5
## 5 5.2 VC 0.5
## 6 7.0 VC 0.5
Arrange arranges the data.frame
according to a
variable.
ToothGrowth[order(ToothGrowth$len), ]
## len supp dose
## 1 4.2 VC 0.5
## 9 5.2 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 10 7.0 VC 0.5
## 3 7.3 VC 0.5
## 37 8.2 OJ 0.5
## 38 9.4 OJ 0.5
## 34 9.7 OJ 0.5
## 40 9.7 OJ 0.5
## 6 10.0 VC 0.5
## 36 10.0 OJ 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 2 11.5 VC 0.5
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 35 14.5 OJ 0.5
## 49 14.5 OJ 1.0
## 13 15.2 VC 1.0
## 31 15.2 OJ 0.5
## 20 15.5 VC 1.0
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 39 16.5 OJ 0.5
## 14 17.3 VC 1.0
## 16 17.3 VC 1.0
## 33 17.6 OJ 0.5
## 22 18.5 VC 2.0
## 19 18.8 VC 1.0
## 41 19.7 OJ 1.0
## 45 20.0 OJ 1.0
## 48 21.2 OJ 1.0
## 28 21.5 VC 2.0
## 32 21.5 OJ 0.5
## 53 22.4 OJ 2.0
## 15 22.5 VC 1.0
## 60 23.0 OJ 2.0
## 29 23.3 VC 2.0
## 42 23.3 OJ 1.0
## 21 23.6 VC 2.0
## 43 23.6 OJ 1.0
## 54 24.5 OJ 2.0
## 55 24.8 OJ 2.0
## 46 25.2 OJ 1.0
## 24 25.5 VC 2.0
## 51 25.5 OJ 2.0
## 47 25.8 OJ 1.0
## 25 26.4 VC 2.0
## 44 26.4 OJ 1.0
## 52 26.4 OJ 2.0
## 57 26.4 OJ 2.0
## 27 26.7 VC 2.0
## 50 27.3 OJ 1.0
## 58 27.3 OJ 2.0
## 59 29.4 OJ 2.0
## 30 29.5 VC 2.0
## 56 30.9 OJ 2.0
## 26 32.5 VC 2.0
## 23 33.9 VC 2.0
What happens if we want to arrange by supp and then by dose? I don’t know.
arrange(ToothGrowth, len)
## len supp dose
## 1 4.2 VC 0.5
## 2 5.2 VC 0.5
## 3 5.8 VC 0.5
## 4 6.4 VC 0.5
## 5 7.0 VC 0.5
## 6 7.3 VC 0.5
## 7 8.2 OJ 0.5
## 8 9.4 OJ 0.5
## 9 9.7 OJ 0.5
## 10 9.7 OJ 0.5
## 11 10.0 VC 0.5
## 12 10.0 OJ 0.5
## 13 11.2 VC 0.5
## 14 11.2 VC 0.5
## 15 11.5 VC 0.5
## 16 13.6 VC 1.0
## 17 14.5 VC 1.0
## 18 14.5 OJ 0.5
## 19 14.5 OJ 1.0
## 20 15.2 VC 1.0
## 21 15.2 OJ 0.5
## 22 15.5 VC 1.0
## 23 16.5 VC 1.0
## 24 16.5 VC 1.0
## 25 16.5 OJ 0.5
## 26 17.3 VC 1.0
## 27 17.3 VC 1.0
## 28 17.6 OJ 0.5
## 29 18.5 VC 2.0
## 30 18.8 VC 1.0
## 31 19.7 OJ 1.0
## 32 20.0 OJ 1.0
## 33 21.2 OJ 1.0
## 34 21.5 VC 2.0
## 35 21.5 OJ 0.5
## 36 22.4 OJ 2.0
## 37 22.5 VC 1.0
## 38 23.0 OJ 2.0
## 39 23.3 VC 2.0
## 40 23.3 OJ 1.0
## 41 23.6 VC 2.0
## 42 23.6 OJ 1.0
## 43 24.5 OJ 2.0
## 44 24.8 OJ 2.0
## 45 25.2 OJ 1.0
## 46 25.5 VC 2.0
## 47 25.5 OJ 2.0
## 48 25.8 OJ 1.0
## 49 26.4 VC 2.0
## 50 26.4 OJ 1.0
## 51 26.4 OJ 2.0
## 52 26.4 OJ 2.0
## 53 26.7 VC 2.0
## 54 27.3 OJ 1.0
## 55 27.3 OJ 2.0
## 56 29.4 OJ 2.0
## 57 29.5 VC 2.0
## 58 30.9 OJ 2.0
## 59 32.5 VC 2.0
## 60 33.9 VC 2.0
arrange(ToothGrowth, supp, dose)
## len supp dose
## 1 15.2 OJ 0.5
## 2 21.5 OJ 0.5
## 3 17.6 OJ 0.5
## 4 9.7 OJ 0.5
## 5 14.5 OJ 0.5
## 6 10.0 OJ 0.5
## 7 8.2 OJ 0.5
## 8 9.4 OJ 0.5
## 9 16.5 OJ 0.5
## 10 9.7 OJ 0.5
## 11 19.7 OJ 1.0
## 12 23.3 OJ 1.0
## 13 23.6 OJ 1.0
## 14 26.4 OJ 1.0
## 15 20.0 OJ 1.0
## 16 25.2 OJ 1.0
## 17 25.8 OJ 1.0
## 18 21.2 OJ 1.0
## 19 14.5 OJ 1.0
## 20 27.3 OJ 1.0
## 21 25.5 OJ 2.0
## 22 26.4 OJ 2.0
## 23 22.4 OJ 2.0
## 24 24.5 OJ 2.0
## 25 24.8 OJ 2.0
## 26 30.9 OJ 2.0
## 27 26.4 OJ 2.0
## 28 27.3 OJ 2.0
## 29 29.4 OJ 2.0
## 30 23.0 OJ 2.0
## 31 4.2 VC 0.5
## 32 11.5 VC 0.5
## 33 7.3 VC 0.5
## 34 5.8 VC 0.5
## 35 6.4 VC 0.5
## 36 10.0 VC 0.5
## 37 11.2 VC 0.5
## 38 11.2 VC 0.5
## 39 5.2 VC 0.5
## 40 7.0 VC 0.5
## 41 16.5 VC 1.0
## 42 16.5 VC 1.0
## 43 15.2 VC 1.0
## 44 17.3 VC 1.0
## 45 22.5 VC 1.0
## 46 17.3 VC 1.0
## 47 13.6 VC 1.0
## 48 14.5 VC 1.0
## 49 18.8 VC 1.0
## 50 15.5 VC 1.0
## 51 23.6 VC 2.0
## 52 18.5 VC 2.0
## 53 33.9 VC 2.0
## 54 25.5 VC 2.0
## 55 26.4 VC 2.0
## 56 32.5 VC 2.0
## 57 26.7 VC 2.0
## 58 21.5 VC 2.0
## 59 23.3 VC 2.0
## 60 29.5 VC 2.0
Here we will select which columns to keep.
ToothGrowth[, c("len","dose")]
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
ToothGrowth[, c(1,3)]
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
ToothGrowth[, -2]
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
subset(ToothGrowth, select = c(len, dose))
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
dplyr::select(ToothGrowth, len, dose)
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
dplyr::select(ToothGrowth, c(1,3))
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
dplyr::select(ToothGrowth, -2)
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
dplyr::select(ToothGrowth, -supp)
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
When data.frames get large, it is annoying to type all the variables you want to keep. Often times we want to keep a collection of variables that have certain properties in their name.
select(ToothGrowth, starts_with("len"))
## len
## 1 4.2
## 2 11.5
## 3 7.3
## 4 5.8
## 5 6.4
## 6 10.0
## 7 11.2
## 8 11.2
## 9 5.2
## 10 7.0
## 11 16.5
## 12 16.5
## 13 15.2
## 14 17.3
## 15 22.5
## 16 17.3
## 17 13.6
## 18 14.5
## 19 18.8
## 20 15.5
## 21 23.6
## 22 18.5
## 23 33.9
## 24 25.5
## 25 26.4
## 26 32.5
## 27 26.7
## 28 21.5
## 29 23.3
## 30 29.5
## 31 15.2
## 32 21.5
## 33 17.6
## 34 9.7
## 35 14.5
## 36 10.0
## 37 8.2
## 38 9.4
## 39 16.5
## 40 9.7
## 41 19.7
## 42 23.3
## 43 23.6
## 44 26.4
## 45 20.0
## 46 25.2
## 47 25.8
## 48 21.2
## 49 14.5
## 50 27.3
## 51 25.5
## 52 26.4
## 53 22.4
## 54 24.5
## 55 24.8
## 56 30.9
## 57 26.4
## 58 27.3
## 59 29.4
## 60 23.0
There are a variety of these helper functions, see
?starts_with
Many times variable names needed to be modified.
d <- ToothGrowth
names(d)
## [1] "len" "supp" "dose"
names(d) <- c("Length","Delivery","Dose")
head(d)
## Length Delivery Dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Valid variable names, see
?make.names
names(d)[2] <- "Delivery Method"
names(d)
## [1] "Length" "Delivery Method" "Dose"
This can get you into trouble
d$Delivery Method
## Error: <text>:1:12: unexpected symbol
## 1: d$Delivery Method
## ^
but you can use “bad” variable names
d$`Delivery Method`
## [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
## [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC
d <- ToothGrowth
names(d)
## [1] "len" "supp" "dose"
d <- rename(d,
Length = len,
`Delivery Method` = supp,
Dose = dose
)
names(d)
## [1] "Length" "Delivery Method" "Dose"
We will often want to create new variables or modify existing variables.
$
We have already seen the extract operator to extract a column out of a data.frame. Here we will use it to create a new variable.
d <- ToothGrowth
d$`Length (mm)` = d$len / 1000 # original length was in microns
head(d)
mutate(ToothGrowth, `Length (mm)` = len)
## len supp dose Length (mm)
## 1 4.2 VC 0.5 4.2
## 2 11.5 VC 0.5 11.5
## 3 7.3 VC 0.5 7.3
## 4 5.8 VC 0.5 5.8
## 5 6.4 VC 0.5 6.4
## 6 10.0 VC 0.5 10.0
## 7 11.2 VC 0.5 11.2
## 8 11.2 VC 0.5 11.2
## 9 5.2 VC 0.5 5.2
## 10 7.0 VC 0.5 7.0
## 11 16.5 VC 1.0 16.5
## 12 16.5 VC 1.0 16.5
## 13 15.2 VC 1.0 15.2
## 14 17.3 VC 1.0 17.3
## 15 22.5 VC 1.0 22.5
## 16 17.3 VC 1.0 17.3
## 17 13.6 VC 1.0 13.6
## 18 14.5 VC 1.0 14.5
## 19 18.8 VC 1.0 18.8
## 20 15.5 VC 1.0 15.5
## 21 23.6 VC 2.0 23.6
## 22 18.5 VC 2.0 18.5
## 23 33.9 VC 2.0 33.9
## 24 25.5 VC 2.0 25.5
## 25 26.4 VC 2.0 26.4
## 26 32.5 VC 2.0 32.5
## 27 26.7 VC 2.0 26.7
## 28 21.5 VC 2.0 21.5
## 29 23.3 VC 2.0 23.3
## 30 29.5 VC 2.0 29.5
## 31 15.2 OJ 0.5 15.2
## 32 21.5 OJ 0.5 21.5
## 33 17.6 OJ 0.5 17.6
## 34 9.7 OJ 0.5 9.7
## 35 14.5 OJ 0.5 14.5
## 36 10.0 OJ 0.5 10.0
## 37 8.2 OJ 0.5 8.2
## 38 9.4 OJ 0.5 9.4
## 39 16.5 OJ 0.5 16.5
## 40 9.7 OJ 0.5 9.7
## 41 19.7 OJ 1.0 19.7
## 42 23.3 OJ 1.0 23.3
## 43 23.6 OJ 1.0 23.6
## 44 26.4 OJ 1.0 26.4
## 45 20.0 OJ 1.0 20.0
## 46 25.2 OJ 1.0 25.2
## 47 25.8 OJ 1.0 25.8
## 48 21.2 OJ 1.0 21.2
## 49 14.5 OJ 1.0 14.5
## 50 27.3 OJ 1.0 27.3
## 51 25.5 OJ 2.0 25.5
## 52 26.4 OJ 2.0 26.4
## 53 22.4 OJ 2.0 22.4
## 54 24.5 OJ 2.0 24.5
## 55 24.8 OJ 2.0 24.8
## 56 30.9 OJ 2.0 30.9
## 57 26.4 OJ 2.0 26.4
## 58 27.3 OJ 2.0 27.3
## 59 29.4 OJ 2.0 29.4
## 60 23.0 OJ 2.0 23.0
Often we will want to create summaries of our data.
mean(ToothGrowth$len)
## [1] 18.81333
table(ToothGrowth$supp)
##
## OJ VC
## 30 30
dplyr::summarize(ToothGrowth, mean_len = mean(len))
## mean_len
## 1 18.81333
dplyr::summarize(
ToothGrowth,
n_VC = sum(supp == "VC"),
n_OJ = sum(supp == "OJ"),
)
## n_VC n_OJ
## 1 30 30
According to IBM,
A data pipeline is a method in which raw data is ingested from various data sources and then ported to data store, like a data lake or data warehouse, for analysis. Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization.
The pipe operator is a simple function that takes everything on the left-hand side (lhs) of the pipe and “pipes” it as the first argument to the function on the right-hand side.
16 %>% sqrt()
## [1] 4
This becomes especially useful when used in conjuction with additional pipe operators.
256 %>% sqrt() %>% sqrt()
## [1] 4
The original pipe operator has moved around a bit, but (I believe) it is now defined in the magrittr. Now, base R has its own pipe operator as of version 4.1.0.
16 |> sqrt()
## [1] 4
Here is a brief history of the pipe operator in R.
ToothGrowth %>%
filter(dose == 0.5, supp == "VC") %>%
summarize(
n = n(),
mean = mean(len),
sd = sd(len))
## n mean sd
## 1 10 7.98 2.746634
Often, we want to split our data up, apply some operations to it, and
then combine the splits back together. In the ToothGrowth
data.frame, we may want to calculate some summary statistics for both
delivery methods.
ToothGrowth %>%
group_by(supp, dose) %>%
summarize(
n = n(),
mean = mean(len),
sd = sd(len),
.groups = "drop"
)
## # A tibble: 6 × 5
## supp dose n mean sd
## <fct> <dbl> <int> <dbl> <dbl>
## 1 OJ 0.5 10 13.2 4.46
## 2 OJ 1 10 22.7 3.91
## 3 OJ 2 10 26.1 2.66
## 4 VC 0.5 10 7.98 2.75
## 5 VC 1 10 16.8 2.52
## 6 VC 2 10 26.1 4.80
First construct necessary data.frames
s <- ToothGrowth %>%
group_by(supp, dose) %>%
summarize(
n = n(),
mean = mean(len),
sd = sd(len),
.groups = "drop"
) %>%
mutate(
ucl = mean + qt(.975, n-1)*sd/sqrt(n),
lcl = mean - qt(.975, n-1)*sd/sqrt(n)
)
Then construct the plot
dw <- 0.1
ggplot(s,
aes(x = dose, color = supp, shape = supp)) +
geom_point(
data = ToothGrowth,
aes(y = len),
position = position_jitterdodge(dodge.width = dw, jitter.width = 0.05)) +
geom_pointrange(
data = s,
aes(y = mean, ymin = lcl, ymax = ucl),
position = position_dodge(width = dw),
shape = 0,
show.legend = FALSE
) +
geom_line(
aes(y = mean, group = supp),
position = position_dodge(width = dw)) +
scale_x_log10() +
labs(
x = "Dose (mg/day)",
y = "Length (\u00b5m)", # unicode \u00b5 is the Greek letter mu
title = "Odontoblast length vs Vitamin C in Guinea Pigs",
color = "Delivery Method",
shape = "Delivery Method") +
theme_bw() +
theme(legend.position = c(0.8, 0.2),
legend.background = element_rect(fill = "white",
color = "black"))
The s
data.frame is not quite ideal for creating a
table. We will preview an approach called pivoting
from the
tidyr
package.
t <- s %>%
mutate(ci = paste0(format(mean, digits = 1, nsmall = 1),
" (",
format(lcl, digits = 1, nsmall = 1),
", ",
format(ucl, digits = 1, nsmall = 1),
")")) %>%
tidyr::pivot_wider(id_cols = dose, names_from = supp, values_from = ci) %>%
rename(
`Orange Juice` = "OJ",
`Ascorbic Acid` = "VC",
Dose = dose
)
t
## # A tibble: 3 × 3
## Dose `Orange Juice` `Ascorbic Acid`
## <dbl> <chr> <chr>
## 1 0.5 13.2 (10.0, 16.4) " 8.0 ( 6.0, 9.9)"
## 2 1 22.7 (19.9, 25.5) "16.8 (15.0, 18.6)"
## 3 2 26.1 (24.2, 28.0) "26.1 (22.7, 29.6)"
The output above is just the data.frame in R. We can construct better looking tables using a variety of methods.
library("xtable")
cap <- "Mean odonotoblast length (\u00b5m) with 95% CIs."
xt <- xtable::xtable(t,
caption=cap,
align="rr|rr")
print(xt, type = "html",
include.rownames = FALSE,
caption.placement = "top")
Dose | Orange Juice | Ascorbic Acid |
---|---|---|
0.50 | 13.2 (10.0, 16.4) | 8.0 ( 6.0, 9.9) |
1.00 | 22.7 (19.9, 25.5) | 16.8 (15.0, 18.6) |
2.00 | 26.1 (24.2, 28.0) | 26.1 (22.7, 29.6) |
Using knitr::kable
library("knitr")
knitr::kable(t, caption = cap, align = "r")
Dose | Orange Juice | Ascorbic Acid |
---|---|---|
0.5 | 13.2 (10.0, 16.4) | 8.0 ( 6.0, 9.9) |
1.0 | 22.7 (19.9, 25.5) | 16.8 (15.0, 18.6) |
2.0 | 26.1 (24.2, 28.0) | 26.1 (22.7, 29.6) |