Verifying data analysis pipelines

Nick Griffiths · Mar 07, 2020

Data analysis pipelines can feel like black boxes: the code might look right, and the results might be reasonable, but it’s not always easy to tell whether they behave as expected. Even small datasets are too big to print out and check directly. The only way to be sure that a pipeline worked is to carefully test it.

Here’s a simple example using the mtcars dataset. We want to calculate the mean of hp within each level (group) of gear.

hp_by_gears <-
  mtcars %>%
  group_by(gear) %>%
  mutate(mean_hp = mean(hp)) %>%
  ungroup()

How do we know we have grouped and calculated the mean correctly?

The typical way to verify is to check one of the groups manually – to spot check.

hp_by_gears %>%
  filter(gear == 5) %>%
  select(gear, hp, mean_hp)
## # A tibble: 5 × 3
##    gear    hp mean_hp
##   <dbl> <dbl>   <dbl>
## 1     5    91   195.6
## 2     5   113   195.6
## 3     5   264   195.6
## 4     5   175   195.6
## 5     5   335   195.6

mean(c(91, 113, 264, 175, 335))
## [1] 195.6

It looks good, and doesn’t identify any problems. But this does a poor job of arguing that it works in every group. While we no longer blindly trust the code, we are still making lots of assumptions:

What we are really looking for is a set of checks that prove the pipeline worked with very minimal assumptions. They should check the whole dataset, not just a few groups.

First, every observation now has a value of mean_hp corresponding to the mean within that group. If we add all these up, we should get the same value as the sum of of the raw values:

sum(hp_by_gears$hp)
## [1] 4694
sum(hp_by_gears$mean_hp)
## [1] 4694

If we didn’t, one of the means must be wrong.

Second, the variable mean_hp should not vary within a group:

hp_by_gears %>%
  group_by(gear) %>%
  summarize(n(), n_distinct(mean_hp))
## # A tibble: 3 × 3
##    gear `n()` `n_distinct(mean_hp)`
##   <dbl> <int>                 <int>
## 1     3    15                     1
## 2     4    12                     1
## 3     5     5                     1

If there were more than one value per group, it would suggest that the grouping failed or the mean didn’t take the right inputs.

If we know that the mean of one group is correct, the sum of the original variable matches the sum of the means, and there is exactly one mean per group, then there is almost no way the process could have failed. That’s the level of confidence we should have in every pipeline.