Recipe for R data pipelines

Nick Griffiths · Apr 09, 2020

There are a lot of different aspects to processing a raw dataset for analysis. It helps to tackle things in order and keep similar steps together.

Before starting, make a plan

Answer these questions:

  • What are the input datasets?
  • What are the output datasets?
  • Will you need any intermediates?

Then describe each dataset. First, what is the unit of observation? One row will contain one observation. It could be one row per person, or per time point, or per group.

Second, what variables will be included? It doesn’t make sense to remove variables that were in the input dataset unless they get in the way of later reshaping or calculations. Keep as much as possible.

Make data accessible (ASAP)

1. Rename columns

Make all names short and informative. With tidyverse, the convention is to use underscore names (col_one).

2. Discard data that gets in the way

Use filter here and try not to use it anywhere else. That way you can easily find all the filters in one place for later reference.

3. Reshape and join datasets

If there are multiple sources of data, and all of them have the same unit of observation, it’s best to merge them as soon as possible.

Often, the input dataset doesn’t have a row for each unit of observation. For example, you might need data with a row per participant but the input dataset has responses spread out across rows. This is the best time to pivot the data into the correct format.

Prep for analysis

4. Recode

The point of recoding is to put data into a format that can be used in calculations or analysis.

Some guidelines:

  • The best format for data depends on its purpose (filter, math, viz, modeling)
  • Format the same information consistently
  • Combine columns when possible (e.g. combine dates and times)
  • Group-wise information should be recycled through the group

The information should not change, so the variables themselves can be overwritten.

5. Perform calculations

Use the following framework:

suppressMessages(library(dplyr))

mtcars %>%
  group_by(vs) %>%
  mutate(newhp = mean(hp[gear == 3]))

Here, hp[gear == 3] operates over observations of hp when gear is 3, and mutate calculates each mean within the groups of vs.

The simple combination of group_by, mutate, and [ can handle a surprisingly wide variety of tasks.

Write some tests to ensure the calculation is correct.

6. Rearrange

If there are columns that define the observation unit, put those first.

When groups of columns are often used together, keep them in order so they can be selected with select(first:last).

R tools

Examples with the nycflights13 data

library(nycflights13)

Testing

Many testing packages are available, including tinytest, assertr, and testthat.

library(tinytest)
library(checkmate) # extends tinytest

mean_delays <- flights %>%
  mutate(mean_del = mean(dep_delay, na.rm = T))

expect_double(mean_delays$mean_del, any.missing = F)
## ----- PASSED      : <-->
##  call| expect_double(mean_delays$mean_del, any.missing = F)

Viewing data

Summary stats:

print(skimr::skim(flights[1:3])) # Better summary (print not needed)
## ── Data Summary ────────────────────────
##                            Values      
## Name                       flights[1:3]
## Number of rows             336776      
## Number of columns          3           
## _______________________                
## Column type frequency:                 
##   numeric                  3           
## ________________________               
## Group variables            None        
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate    mean   sd   p0  p25  p50  p75 p100
## 1 year                  0             1 2013    0    2013 2013 2013 2013 2013
## 2 month                 0             1    6.55 3.41    1    4    7   10   12
## 3 day                   0             1   15.7  8.77    1    8   16   23   31
##   hist 
## 1 ▁▁▇▁▁
## 2 ▇▆▆▆▇
## 3 ▇▇▇▇▆
## $numeric
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate    mean   sd   p0  p25  p50  p75 p100
## 1 year                  0             1 2013    0    2013 2013 2013 2013 2013
## 2 month                 0             1    6.55 3.41    1    4    7   10   12
## 3 day                   0             1   15.7  8.77    1    8   16   23   31
## # … with 1 more variable: hist <chr>

Head of all variables:

tibble::glimpse(flights) # Better head
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Getting help

Docs and examples:

?sum

example(sum)
## 
## sum> ## Pass a vector to sum, and it will add the elements together.
## sum> sum(1:5)
## [1] 15
## 
## sum> ## Pass several numbers to sum, and it also adds the elements.
## sum> sum(1, 2, 3, 4, 5)
## [1] 15
## 
## sum> ## In fact, you can pass vectors into several arguments, and everything gets added.
## sum> sum(1:2, 3:5)
## [1] 15
## 
## sum> ## If there are missing values, the sum is unknown, i.e., also missing, ....
## sum> sum(1:5, NA)
## [1] NA
## 
## sum> ## ... unless  we exclude missing values explicitly:
## sum> sum(1:5, NA, na.rm = TRUE)
## [1] 15