We will start with loading up the necessary packages using the
library()
command.
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Before, we can start reading data in from our file system, we need to 1) understand the concept of a working directory and 2) have a file that we can read.
getwd()
## [1] "C:/Users/niemi/Documents/git/jarad.github.com/courses/stat486/slides/07-wrangling"
You can set the working directory in RStudio by going to
Session > Set Working Directory > Choose Working Directory
Alternatively, we can use the setwd()
function.
?setwd
Use RStudio projects!! This will set your working directory to the same folder every time you open that project.
In order to read a file, we need an appropriate file. Comma-separated value (csv) files are a de-facto standard for transporting data. In this file type, each column is separated by a comma and each row is separated by a newline.
readr::write_csv(ToothGrowth, file = "ToothGrowth.csv")
d <- readr::read_csv("ToothGrowth.csv")
## Rows: 60 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): supp
## dbl (2): len, dose
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all.equal(ToothGrowth, d)
## [1] "Attributes: < Names: 1 string mismatch >"
## [2] "Attributes: < Length mismatch: comparison on first 2 components >"
## [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
## [4] "Attributes: < Component \"class\": 1 string mismatch >"
## [5] "Attributes: < Component 2: Modes: numeric, externalptr >"
## [6] "Attributes: < Component 2: Lengths: 60, 1 >"
## [7] "Attributes: < Component 2: target is numeric, current is externalptr >"
## [8] "Component \"supp\": 'current' is not a factor"
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
str(d)
## spc_tbl_ [60 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ len : num [1:60] 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: chr [1:60] "VC" "VC" "VC" "VC" ...
## $ dose: num [1:60] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## - attr(*, "spec")=
## .. cols(
## .. len = col_double(),
## .. supp = col_character(),
## .. dose = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
d <- readr::read_csv("ToothGrowth.csv",
col_types = cols(
len = col_double(),
supp = col_factor(),
dose = col_double()
))
Check out
install.packages("readxl")
?readxl::read_excel
To read other types of files, including databases, start with these suggestions in R4DS.
There is no official standard for CSV file formats, but there are attempts to define a standard or at least outline what is standard.
The file format only suggests an (optional) header row. Otherwise all rows should strictly be data.
Some (Canvas….ahem!) add additional row(s) between the header and the data. This should not be done as it makes it difficult for standard software including R to read it since data is assumed to immediately come after the header.
If you want to include additional rows, you can include rows above
the header row. See the skip
argument of
readr::read_csv()
.
A common use of spreadsheets is to calculate subtotals either by adding additional rows or columns. Although these can be removed, it is annoying and therefore should be avoided. Simply calculate subtotals in a different Excel sheet or copy the data over to a new Excel sheet.
Data often come in multiple files and need to be wrangled to get it into the format necessary for analysis.
A common issue is that data are created on a periodic basis and each period has its own file. In these circumstances, you need to bind the data together.
dA <- ToothGrowth %>% filter(supp == "VC")
dB <- ToothGrowth %>% filter(supp == "OJ")
bind_rows(dA, dB)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
## 16 17.3 VC 1.0
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 19 18.8 VC 1.0
## 20 15.5 VC 1.0
## 21 23.6 VC 2.0
## 22 18.5 VC 2.0
## 23 33.9 VC 2.0
## 24 25.5 VC 2.0
## 25 26.4 VC 2.0
## 26 32.5 VC 2.0
## 27 26.7 VC 2.0
## 28 21.5 VC 2.0
## 29 23.3 VC 2.0
## 30 29.5 VC 2.0
## 31 15.2 OJ 0.5
## 32 21.5 OJ 0.5
## 33 17.6 OJ 0.5
## 34 9.7 OJ 0.5
## 35 14.5 OJ 0.5
## 36 10.0 OJ 0.5
## 37 8.2 OJ 0.5
## 38 9.4 OJ 0.5
## 39 16.5 OJ 0.5
## 40 9.7 OJ 0.5
## 41 19.7 OJ 1.0
## 42 23.3 OJ 1.0
## 43 23.6 OJ 1.0
## 44 26.4 OJ 1.0
## 45 20.0 OJ 1.0
## 46 25.2 OJ 1.0
## 47 25.8 OJ 1.0
## 48 21.2 OJ 1.0
## 49 14.5 OJ 1.0
## 50 27.3 OJ 1.0
## 51 25.5 OJ 2.0
## 52 26.4 OJ 2.0
## 53 22.4 OJ 2.0
## 54 24.5 OJ 2.0
## 55 24.8 OJ 2.0
## 56 30.9 OJ 2.0
## 57 26.4 OJ 2.0
## 58 27.3 OJ 2.0
## 59 29.4 OJ 2.0
## 60 23.0 OJ 2.0
d <- list(
A = ToothGrowth %>% filter(supp == "VC", dose == 0.5),
B = ToothGrowth %>% filter(supp == "VC", dose == 1),
C = ToothGrowth %>% filter(supp == "VC", dose == 2),
D = ToothGrowth %>% filter(supp == "OJ", dose == 0.5),
E = ToothGrowth %>% filter(supp == "OJ", dose == 1),
F = ToothGrowth %>% filter(supp == "OJ", dose == 2)
)
class(d)
## [1] "list"
d[[1]]
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
d$B
## len supp dose
## 1 16.5 VC 1
## 2 16.5 VC 1
## 3 15.2 VC 1
## 4 17.3 VC 1
## 5 22.5 VC 1
## 6 17.3 VC 1
## 7 13.6 VC 1
## 8 14.5 VC 1
## 9 18.8 VC 1
## 10 15.5 VC 1
bind_rows(d)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
## 11 16.5 VC 1.0
## 12 16.5 VC 1.0
## 13 15.2 VC 1.0
## 14 17.3 VC 1.0
## 15 22.5 VC 1.0
## 16 17.3 VC 1.0
## 17 13.6 VC 1.0
## 18 14.5 VC 1.0
## 19 18.8 VC 1.0
## 20 15.5 VC 1.0
## 21 23.6 VC 2.0
## 22 18.5 VC 2.0
## 23 33.9 VC 2.0
## 24 25.5 VC 2.0
## 25 26.4 VC 2.0
## 26 32.5 VC 2.0
## 27 26.7 VC 2.0
## 28 21.5 VC 2.0
## 29 23.3 VC 2.0
## 30 29.5 VC 2.0
## 31 15.2 OJ 0.5
## 32 21.5 OJ 0.5
## 33 17.6 OJ 0.5
## 34 9.7 OJ 0.5
## 35 14.5 OJ 0.5
## 36 10.0 OJ 0.5
## 37 8.2 OJ 0.5
## 38 9.4 OJ 0.5
## 39 16.5 OJ 0.5
## 40 9.7 OJ 0.5
## 41 19.7 OJ 1.0
## 42 23.3 OJ 1.0
## 43 23.6 OJ 1.0
## 44 26.4 OJ 1.0
## 45 20.0 OJ 1.0
## 46 25.2 OJ 1.0
## 47 25.8 OJ 1.0
## 48 21.2 OJ 1.0
## 49 14.5 OJ 1.0
## 50 27.3 OJ 1.0
## 51 25.5 OJ 2.0
## 52 26.4 OJ 2.0
## 53 22.4 OJ 2.0
## 54 24.5 OJ 2.0
## 55 24.8 OJ 2.0
## 56 30.9 OJ 2.0
## 57 26.4 OJ 2.0
## 58 27.3 OJ 2.0
## 59 29.4 OJ 2.0
## 60 23.0 OJ 2.0
Hmm…maybe don’t bother??
Data sets are often in multiple files with each file contain some of the information.
The most common join operation I use is left_join()
.
Rather than refactoring the supp
column in
ToothGrowth
, I would probably just create a new tibble that
contains the necessary information.
suppTranslation <- tibble(
supp = c("VC","OJ"),
`Delivery Method` = c("Ascorbic Acid","Orange Juice")
)
Now, I can simply left_join()
these two data sets:
left_join(ToothGrowth, suppTranslation, by = "supp")
## len supp dose Delivery Method
## 1 4.2 VC 0.5 Ascorbic Acid
## 2 11.5 VC 0.5 Ascorbic Acid
## 3 7.3 VC 0.5 Ascorbic Acid
## 4 5.8 VC 0.5 Ascorbic Acid
## 5 6.4 VC 0.5 Ascorbic Acid
## 6 10.0 VC 0.5 Ascorbic Acid
## 7 11.2 VC 0.5 Ascorbic Acid
## 8 11.2 VC 0.5 Ascorbic Acid
## 9 5.2 VC 0.5 Ascorbic Acid
## 10 7.0 VC 0.5 Ascorbic Acid
## 11 16.5 VC 1.0 Ascorbic Acid
## 12 16.5 VC 1.0 Ascorbic Acid
## 13 15.2 VC 1.0 Ascorbic Acid
## 14 17.3 VC 1.0 Ascorbic Acid
## 15 22.5 VC 1.0 Ascorbic Acid
## 16 17.3 VC 1.0 Ascorbic Acid
## 17 13.6 VC 1.0 Ascorbic Acid
## 18 14.5 VC 1.0 Ascorbic Acid
## 19 18.8 VC 1.0 Ascorbic Acid
## 20 15.5 VC 1.0 Ascorbic Acid
## 21 23.6 VC 2.0 Ascorbic Acid
## 22 18.5 VC 2.0 Ascorbic Acid
## 23 33.9 VC 2.0 Ascorbic Acid
## 24 25.5 VC 2.0 Ascorbic Acid
## 25 26.4 VC 2.0 Ascorbic Acid
## 26 32.5 VC 2.0 Ascorbic Acid
## 27 26.7 VC 2.0 Ascorbic Acid
## 28 21.5 VC 2.0 Ascorbic Acid
## 29 23.3 VC 2.0 Ascorbic Acid
## 30 29.5 VC 2.0 Ascorbic Acid
## 31 15.2 OJ 0.5 Orange Juice
## 32 21.5 OJ 0.5 Orange Juice
## 33 17.6 OJ 0.5 Orange Juice
## 34 9.7 OJ 0.5 Orange Juice
## 35 14.5 OJ 0.5 Orange Juice
## 36 10.0 OJ 0.5 Orange Juice
## 37 8.2 OJ 0.5 Orange Juice
## 38 9.4 OJ 0.5 Orange Juice
## 39 16.5 OJ 0.5 Orange Juice
## 40 9.7 OJ 0.5 Orange Juice
## 41 19.7 OJ 1.0 Orange Juice
## 42 23.3 OJ 1.0 Orange Juice
## 43 23.6 OJ 1.0 Orange Juice
## 44 26.4 OJ 1.0 Orange Juice
## 45 20.0 OJ 1.0 Orange Juice
## 46 25.2 OJ 1.0 Orange Juice
## 47 25.8 OJ 1.0 Orange Juice
## 48 21.2 OJ 1.0 Orange Juice
## 49 14.5 OJ 1.0 Orange Juice
## 50 27.3 OJ 1.0 Orange Juice
## 51 25.5 OJ 2.0 Orange Juice
## 52 26.4 OJ 2.0 Orange Juice
## 53 22.4 OJ 2.0 Orange Juice
## 54 24.5 OJ 2.0 Orange Juice
## 55 24.8 OJ 2.0 Orange Juice
## 56 30.9 OJ 2.0 Orange Juice
## 57 26.4 OJ 2.0 Orange Juice
## 58 27.3 OJ 2.0 Orange Juice
## 59 29.4 OJ 2.0 Orange Juice
## 60 23.0 OJ 2.0 Orange Juice
Of course, if the tibbles were in a different order (perhaps due to a
data pipeline), I could use right_join()
.
right_join(suppTranslation, ToothGrowth, by = "supp")
## # A tibble: 60 × 4
## supp `Delivery Method` len dose
## <chr> <chr> <dbl> <dbl>
## 1 VC Ascorbic Acid 4.2 0.5
## 2 VC Ascorbic Acid 11.5 0.5
## 3 VC Ascorbic Acid 7.3 0.5
## 4 VC Ascorbic Acid 5.8 0.5
## 5 VC Ascorbic Acid 6.4 0.5
## 6 VC Ascorbic Acid 10 0.5
## 7 VC Ascorbic Acid 11.2 0.5
## 8 VC Ascorbic Acid 11.2 0.5
## 9 VC Ascorbic Acid 5.2 0.5
## 10 VC Ascorbic Acid 7 0.5
## # … with 50 more rows
One way to filter observations is to use the
dplyr::filter()
function. If the observations you want to
remove are already part of a data.frame, then you can use the
anti_join()
function.
len10 <- filter(ToothGrowth, len < 10)
anti_join(ToothGrowth, len10, by = c("len", "supp", "dose"))
## len supp dose
## 1 11.5 VC 0.5
## 2 10.0 VC 0.5
## 3 11.2 VC 0.5
## 4 11.2 VC 0.5
## 5 16.5 VC 1.0
## 6 16.5 VC 1.0
## 7 15.2 VC 1.0
## 8 17.3 VC 1.0
## 9 22.5 VC 1.0
## 10 17.3 VC 1.0
## 11 13.6 VC 1.0
## 12 14.5 VC 1.0
## 13 18.8 VC 1.0
## 14 15.5 VC 1.0
## 15 23.6 VC 2.0
## 16 18.5 VC 2.0
## 17 33.9 VC 2.0
## 18 25.5 VC 2.0
## 19 26.4 VC 2.0
## 20 32.5 VC 2.0
## 21 26.7 VC 2.0
## 22 21.5 VC 2.0
## 23 23.3 VC 2.0
## 24 29.5 VC 2.0
## 25 15.2 OJ 0.5
## 26 21.5 OJ 0.5
## 27 17.6 OJ 0.5
## 28 14.5 OJ 0.5
## 29 10.0 OJ 0.5
## 30 16.5 OJ 0.5
## 31 19.7 OJ 1.0
## 32 23.3 OJ 1.0
## 33 23.6 OJ 1.0
## 34 26.4 OJ 1.0
## 35 20.0 OJ 1.0
## 36 25.2 OJ 1.0
## 37 25.8 OJ 1.0
## 38 21.2 OJ 1.0
## 39 14.5 OJ 1.0
## 40 27.3 OJ 1.0
## 41 25.5 OJ 2.0
## 42 26.4 OJ 2.0
## 43 22.4 OJ 2.0
## 44 24.5 OJ 2.0
## 45 24.8 OJ 2.0
## 46 30.9 OJ 2.0
## 47 26.4 OJ 2.0
## 48 27.3 OJ 2.0
## 49 29.4 OJ 2.0
## 50 23.0 OJ 2.0
There are also inner_join()
, full_join()
,
semi_join()
, and nest_join()
. I don’t use
these, but probably I should.
Another very common need is to convert data from a relatively wide format to a long format for the purpose of visualization or modeling. Wide is generally a more efficient storage structure while most statistical software expects a long format.
For example, the AirPassengers
data.frame contains
monthly air passengers across years.
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
Formally, AirPassengers
is a time series object and thus
R treats it a bit differently. Let’s convert it to a data.frame. Can you
understand the piped code below?
months <- c("Jan","Feb","Mar","Apr","May","Jun",
"July","Aug","Sep","Oct","Nov","Dec")
d <- AirPassengers %>%
matrix(ncol=12) %>%
as.data.frame() %>%
setNames(months) %>%
mutate(year = 1949:1960) %>%
relocate(year)
Now we have a wide data frame, but we will typically want a long data.frame for visualization or analysis.
To convert to a longer data.frame where each row is a single
observation, we can use pivot_longer()
.
dl <- d %>%
pivot_longer(Jan:Dec, names_to = "month", values_to = "passengers") %>%
mutate(month = factor(month, levels = months)) # to order the months
table(dl$month)
##
## Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec
## 12 12 12 12 12 12 12 12 12 12 12 12
dl %>%
group_by(month) %>%
summarize(
n = n(),
mean = mean(passengers),
sd = sd(passengers)
)
## # A tibble: 12 × 4
## month n mean sd
## <fct> <int> <dbl> <dbl>
## 1 Jan 12 127. 13.7
## 2 Feb 12 140. 19.1
## 3 Mar 12 170. 18.4
## 4 Apr 12 197 23.0
## 5 May 12 225 28.5
## 6 Jun 12 239. 34.9
## 7 July 12 284 42.1
## 8 Aug 12 328. 47.9
## 9 Sep 12 368. 57.9
## 10 Oct 12 381 64.5
## 11 Nov 12 428. 69.8
## 12 Dec 12 476. 77.7
Plot yearly comparisons
ggplot(dl, aes(x = month, y = passengers,
group = year, color = year)) +
geom_line() +
scale_y_log10() +
theme_bw()
Fit a regression model
m <- lm(log(passengers) ~ year + month, data = dl)
summary(m)
##
## Call:
## lm(formula = log(passengers) ~ year + month, data = dl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24772 -0.09147 -0.02467 0.08133 0.28331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.055371 6.447154 -1.405 0.1625
## year 0.007107 0.003299 2.155 0.0330 *
## monthFeb 0.094646 0.055784 1.697 0.0921 .
## monthMar 0.295231 0.055784 5.292 4.94e-07 ***
## monthApr 0.441052 0.055784 7.906 9.62e-13 ***
## monthMay 0.572518 0.055784 10.263 < 2e-16 ***
## monthJun 0.630359 0.055784 11.300 < 2e-16 ***
## monthJuly 0.803090 0.055784 14.397 < 2e-16 ***
## monthAug 0.948250 0.055784 16.999 < 2e-16 ***
## monthSep 1.062194 0.055784 19.041 < 2e-16 ***
## monthOct 1.094404 0.055784 19.619 < 2e-16 ***
## monthNov 1.212189 0.055784 21.730 < 2e-16 ***
## monthDec 1.318037 0.055784 23.628 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1366 on 131 degrees of freedom
## Multiple R-squared: 0.9122, Adjusted R-squared: 0.9042
## F-statistic: 113.5 on 12 and 131 DF, p-value: < 2.2e-16
Less often, we want to convert back to a wide format.
dl %>% pivot_wider(id_cols = year,
names_from = "month",
values_from = "passengers")
## # A tibble: 12 × 13
## year Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1949 112 115 145 171 196 204 242 284 315 340 360 417
## 2 1950 118 126 150 180 196 188 233 277 301 318 342 391
## 3 1951 132 141 178 193 236 235 267 317 356 362 406 419
## 4 1952 129 135 163 181 235 227 269 313 348 348 396 461
## 5 1953 121 125 172 183 229 234 270 318 355 363 420 472
## 6 1954 135 149 178 218 243 264 315 374 422 435 472 535
## 7 1955 148 170 199 230 264 302 364 413 465 491 548 622
## 8 1956 148 170 199 242 272 293 347 405 467 505 559 606
## 9 1957 136 158 184 209 237 259 312 355 404 404 463 508
## 10 1958 119 133 162 191 211 229 274 306 347 359 407 461
## 11 1959 104 114 146 172 180 203 237 271 305 310 362 390
## 12 1960 118 140 166 194 201 229 278 306 336 337 405 432
Generally, I don’t suggest story data in binary formats, but these formats can be useful to store intermediate data. For example, if there is some important results from a statistical analysis that takes a long time to perform (I’m looking at you Bayesians) you might want to store the results in a binary format.
There are two functions that will save RData
files:
save()
and save.image()
. The latter will save
everything in the environment while the former will only save what it is
specifically told to save. Both of these are read using the
load()
function.
a <- 1
save(a, file = "a.RData")
An RDS file contains a single R object. These files can be written
using saveRDS()
and read using readRDS()
.
saveRDS(a, file = "a.RDS")
rm(a)
When you read this file, you need to save it into an R object.
b <- readRDS("a.RDS")
unlink("ToothGrowth.csv")
unlink("a.RData")
unlink("a.RDS")