There are a lot of different aspects to processing a raw dataset for analysis. It helps to tackle things in order and keep similar steps together.
Before starting, make a plan
Answer these questions:
- What are the input datasets?
- What are the output datasets?
- Will you need any intermediates?
Then describe each dataset. First, what is the unit of observation? One row will contain one observation. It could be one row per person, or per time point, or per group.
Second, what variables will be included? It doesn’t make sense to remove variables that were in the input dataset unless they get in the way of later reshaping or calculations. Keep as much as possible.
Make data accessible (ASAP)
1. Rename columns
Make all names short and informative. With tidyverse
, the convention is to use underscore names (col_one
).
2. Discard data that gets in the way
Use filter
here and try not to use it anywhere else. That way you can easily find all the filters in one place for later reference.
3. Reshape and join datasets
If there are multiple sources of data, and all of them have the same unit of observation, it’s best to merge them as soon as possible.
Often, the input dataset doesn’t have a row for each unit of observation. For example, you might need data with a row per participant but the input dataset has responses spread out across rows. This is the best time to pivot the data into the correct format.
Prep for analysis
4. Recode
The point of recoding is to put data into a format that can be used in calculations or analysis.
Some guidelines:
- The best format for data depends on its purpose (filter, math, viz, modeling)
- Format the same information consistently
- Combine columns when possible (e.g. combine dates and times)
- Group-wise information should be recycled through the group
The information should not change, so the variables themselves can be overwritten.
5. Perform calculations
Use the following framework:
suppressMessages(library(dplyr))
mtcars %>%
group_by(vs) %>%
mutate(newhp = mean(hp[gear == 3]))
Here, hp[gear == 3]
operates over observations of hp
when gear
is 3, and mutate
calculates each mean within the groups of vs
.
The simple combination of group_by
, mutate
, and [
can handle a surprisingly wide variety of tasks.
Write some tests to ensure the calculation is correct.
6. Rearrange
If there are columns that define the observation unit, put those first.
When groups of columns are often used together, keep them in order so they can be selected with select(first:last)
.
R tools
Examples with the nycflights13
data
library(nycflights13)
Testing
Many testing packages are available, including tinytest
, assertr
, and testthat
.
library(tinytest)
library(checkmate) # extends tinytest
mean_delays <- flights %>%
mutate(mean_del = mean(dep_delay, na.rm = T))
expect_double(mean_delays$mean_del, any.missing = F)
## ----- PASSED : <-->
## call| expect_double(mean_delays$mean_del, any.missing = F)
Viewing data
Summary stats:
print(skimr::skim(flights[1:3])) # Better summary (print not needed)
## ── Data Summary ────────────────────────
## Values
## Name flights[1:3]
## Number of rows 336776
## Number of columns 3
## _______________________
## Column type frequency:
## numeric 3
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
## 1 year 0 1 2013 0 2013 2013 2013 2013 2013
## 2 month 0 1 6.55 3.41 1 4 7 10 12
## 3 day 0 1 15.7 8.77 1 8 16 23 31
## hist
## 1 ▁▁▇▁▁
## 2 ▇▆▆▆▇
## 3 ▇▇▇▇▆
## $numeric
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
## 1 year 0 1 2013 0 2013 2013 2013 2013 2013
## 2 month 0 1 6.55 3.41 1 4 7 10 12
## 3 day 0 1 15.7 8.77 1 8 16 23 31
## # … with 1 more variable: hist <chr>
Head of all variables:
tibble::glimpse(flights) # Better head
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Getting help
Docs and examples:
?sum
example(sum)
##
## sum> ## Pass a vector to sum, and it will add the elements together.
## sum> sum(1:5)
## [1] 15
##
## sum> ## Pass several numbers to sum, and it also adds the elements.
## sum> sum(1, 2, 3, 4, 5)
## [1] 15
##
## sum> ## In fact, you can pass vectors into several arguments, and everything gets added.
## sum> sum(1:2, 3:5)
## [1] 15
##
## sum> ## If there are missing values, the sum is unknown, i.e., also missing, ....
## sum> sum(1:5, NA)
## [1] NA
##
## sum> ## ... unless we exclude missing values explicitly:
## sum> sum(1:5, NA, na.rm = TRUE)
## [1] 15