Pivoting data is just rearranging groups

Nick Griffiths · Jun 22, 2020

After a while of using the pivot_wider and pivot_longer tidyverse functions, I still can never remember how they work. The argument names never seem to stick.

I’ve found it’s easier to think about in terms of grouping. Pivoting is just an operation that can take groups of data and sort them into columns or combine them into rows. The key to understanding pivoting is to visualize the groups that are being rearranged.

For example, take the tidyr::fish_encounters dataset. It is important to recognize that this dataset only contains a single column of actual data, which is the seen column. The fish and station columns can be thought of as grouping variables for seen.

Imagine annotating each value of seen with the corresponding value of station and fish, so the fish and station columns are embedded in a single column. Then you can move around these seen values however you want with no loss of information. Each annotation (fish or station) can define either a group of rows or a set of columns.

To pivot wider, you make groups by fish and station and use fish to group rows, while station sorts the data into columns.

The corresponding R code for this is below:

library(tidyr)
wide_fishes <- pivot_wider(fish_encounters, id_cols = fish, names_from = "station", values_from = seen)
wide_fishes
## # A tibble: 19 × 12
##    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
##    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
##  1 4842        1     1      1     1       1     1     1     1     1     1     1
##  2 4843        1     1      1     1       1     1     1     1     1     1     1
##  3 4844        1     1      1     1       1     1     1     1     1     1     1
##  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
##  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
##  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA
##  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
##  8 4850        1     1     NA     1       1     1     1    NA    NA    NA    NA
##  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 11 4855        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 12 4857        1     1      1     1       1     1     1     1     1    NA    NA
## 13 4858        1     1      1     1       1     1     1     1     1     1     1
## 14 4859        1     1      1     1       1    NA    NA    NA    NA    NA    NA
## 15 4861        1     1      1     1       1     1     1     1     1     1     1
## 16 4862        1     1      1     1       1     1     1     1     1    NA    NA
## 17 4863        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 18 4864        1     1     NA    NA      NA    NA    NA    NA    NA    NA    NA
## 19 4865        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA

So, to translate, values_from selects the data, id_cols defines the grouping you want to keep as row groups, and names_from is the grouping you want to translate to columns. Every column should fall in one of these categories.

Similarly, we can pivot back to a longer format by doing the reverse. We have a grouping variable fish and a data column for each station. We make groups of data using fish and station, then stack all the data back into a single column.

pivot_longer(wide_fishes, cols = -fish, names_to = "station", values_to = "seen")
## # A tibble: 209 × 3
##    fish  station  seen
##    <fct> <chr>   <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 199 more rows

The pivot_longer function treats all selected columns as data columns, and all non-selected columns as grouping variables. It then combines all the data columns into a single column. We select our data with cols. values_to names the new combined data column, and names_to names the new grouping variable.