Reading a directory of data with drake

drake is an R package that provides make functionality entirely within R. That is, it will run a set of commands in a hierarchical (or tree) structure. Then, when when pieces of that structure changes, drake will only re-run the pieces that need to be re-run.

I have been using these two functions to read in a directory worth of data. Everytime I run these functions, I need to re-read the entire directory. It would be more convenient if I could use drake, or something similar, so that I only need to reread the files that have changed.

So here is a script that will perform that process

options(width = 100)
library("drake")

dir.create("data")

## Warning in dir.create("data"): 'data' already exists

write.csv(data.frame(g = 1, x = 1), file = "data/g1.csv")
write.csv(data.frame(g = 2, x = 2), file = "data/g2.csv")
files = list.files("data", "*.csv", full.names = TRUE)

add2 = function(d) { # example function to apply to each individual data.frame
  d$x = d$x+2
  return(d)
}

plan = drake_plan( # This is where you define the set of commands to run
  data  = target(
    read.csv(file_in(file)),
    transform = map(file = !!files)
  ),
  add2  = target(
    add2(data),
    transform = map(data)
  ),
  all = target(
    dplyr::bind_rows(add2),
    transform = combine(add2)
  ),
  out = saveRDS(all, file = file_out("all.RDS"))
)

Let’s take a look at the plan

plan # Take a look at the targets and commands that will be run

## # A tibble: 6 x 2
##   target                command                                                       
##   <chr>                 <expr>                                                        
## 1 add2_data_data.g1.csv add2(data_data.g1.csv)                                        
## 2 add2_data_data.g2.csv add2(data_data.g2.csv)                                        
## 3 all                   dplyr::bind_rows(add2_data_data.g1.csv, add2_data_data.g2.csv)
## 4 data_data.g1.csv      read.csv(file_in("data/g1.csv"))                              
## 5 data_data.g2.csv      read.csv(file_in("data/g2.csv"))                              
## 6 out                   saveRDS(all, file = file_out("all.RDS"))

Now to actually run the plan use

make(plan)

## ▶ target data_data.g1.csv

## ▶ target add2_data_data.g1.csv

## ▶ target all

## ▶ target out

If you try to run the plan again, drake tells you

make(plan)

## ✓ All targets are already up to date.

Now if a file changes, you can just rerun the plan.

write.csv(data.frame(g = 1, x = 11), file = "data/g1.csv")
make(plan)

## ▶ target data_data.g1.csv

## ▶ target add2_data_data.g1.csv

## ▶ target all

## ▶ target out

blog comments powered by Disqus

Published

31 May 2020

Reading a directory of data with drake

Published

Tags