https://github.com/gadenbuie/tidyexplain/blob/main/images/tidyr-pivoting.gif
https://github.com/gadenbuie/tidyexplain/blob/main/images/tidyr-pivoting.gif
Data is stored differently in the tibble.
Wide: has many columns
# A tibble: 1 × 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <dbl> <dbl> <dbl> 1 Alabama 0.516 0.514 0.511
Long: column names become data
# A tibble: 3 × 3 State name value <chr> <chr> <dbl> 1 Alabama June_vacc_rate 0.516 2 Alabama May_vacc_rate 0.514 3 Alabama April_vacc_rate 0.511
Wide: multiple columns per individual, values spread across multiple columns
# A tibble: 2 × 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <dbl> <dbl> <dbl> 1 Alabama 0.516 0.514 0.511 2 Alaska 0.627 0.626 0.623
Long: multiple rows per observation, a single column contains the values
# A tibble: 6 × 3 State name value <chr> <chr> <dbl> 1 Alabama June_vacc_rate 0.516 2 Alabama May_vacc_rate 0.514 3 Alabama April_vacc_rate 0.511 4 Alaska June_vacc_rate 0.627 5 Alaska May_vacc_rate 0.626 6 Alaska April_vacc_rate 0.623
Data is wide or long with respect to certain variables.
Wide: Easier for humans to read
# A tibble: 2 × 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <dbl> <dbl> <dbl> 1 Alabama 0.516 0.514 0.511 2 Alaska 0.627 0.626 0.623
Long: Easier for R to make plots & do analysis
# A tibble: 6 × 3 State name value <chr> <chr> <dbl> 1 Alabama June_vacc_rate 0.516 2 Alabama May_vacc_rate 0.514 3 Alabama April_vacc_rate 0.511 4 Alaska June_vacc_rate 0.627 5 Alaska May_vacc_rate 0.626 6 Alaska April_vacc_rate 0.623
tidyr
packagetidyr
allows you to “tidy” your data. We will be talking about:
pivot_longer
- make multiple columns into variables, (wide to long)pivot_wider
- make a variable into multiple columns, (long to wide)separate
- string into multiple columnsThe reshape
command exists. Its arguments are considered more confusing, so we don’t recommend it.
You might see old functions gather
and spread
when googling. These are older iterations of pivot_longer
and pivot_wider
, respectively.
pivot_longer
…pivot_longer()
- puts column data into rows (tidyr
package)
{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot})
wide_data
# A tibble: 1 × 3 June_vacc_rate May_vacc_rate April_vacc_rate <dbl> <dbl> <dbl> 1 0.516 0.514 0.511
long_data <- wide_data %>% pivot_longer(cols = everything()) long_data
# A tibble: 3 × 2 name value <chr> <dbl> 1 June_vacc_rate 0.516 2 May_vacc_rate 0.514 3 April_vacc_rate 0.511
pivot_longer()
- puts column data into rows (tidyr
package)
names_to =
gives a new name to the pivoted columnsvalues_to =
gives a new name to the values that used to be in those columns{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot}, names_to = {New column name: contains old column names}, values_to = {New column name: contains cell values})
wide_data
# A tibble: 1 × 3 June_vacc_rate May_vacc_rate April_vacc_rate <dbl> <dbl> <dbl> 1 0.516 0.514 0.511
long_data <- wide_data %>% pivot_longer(cols = everything(), names_to = "Month", values_to = "Rate") long_data
# A tibble: 3 × 2 Month Rate <chr> <dbl> 1 June_vacc_rate 0.516 2 May_vacc_rate 0.514 3 April_vacc_rate 0.511
Newly created column names are enclosed in quotation marks.
https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv
circ <- read_csv("https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv") head(circ, 5)
# A tibble: 5 × 15 day date orangeBoardings orangeAlightings orangeAverage purpleBoardings purpleAlightings purpleAverage <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Monday 01/1… 877 1027 952 NA NA NA 2 Tuesday 01/1… 777 815 796 NA NA NA 3 Wednesd… 01/1… 1203 1220 1212. NA NA NA 4 Thursday 01/1… 1194 1233 1214. NA NA NA 5 Friday 01/1… 1645 1643 1644 NA NA NA # ℹ 7 more variables: greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, bannerBoardings <dbl>, # bannerAlightings <dbl>, bannerAverage <dbl>, daily <dbl>
long <- circ %>% pivot_longer(starts_with(c("orange","purple","green","banner"))) long
# A tibble: 13,752 × 5 day date daily name value <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # ℹ 13,742 more rows
There are many ways to select the columns we want. Use ?tidyr_tidy_select
to look at more column selection options.
long <- circ %>% pivot_longer( !c(day, date, daily)) long
# A tibble: 13,752 × 5 day date daily name value <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # ℹ 13,742 more rows
We will use str_replace
from the stringr
package to put _
in the names
long <- long %>% mutate( name = str_replace(name, "Board", "_Board"), name = str_replace(name, "Alight", "_Alight"), name = str_replace(name, "Average", "_Average") ) long
# A tibble: 13,752 × 5 day date daily name value <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orange_Boardings 877 2 Monday 01/11/2010 952 orange_Alightings 1027 3 Monday 01/11/2010 952 orange_Average 952 4 Monday 01/11/2010 952 purple_Boardings NA 5 Monday 01/11/2010 952 purple_Alightings NA 6 Monday 01/11/2010 952 purple_Average NA 7 Monday 01/11/2010 952 green_Boardings NA 8 Monday 01/11/2010 952 green_Alightings NA 9 Monday 01/11/2010 952 green_Average NA 10 Monday 01/11/2010 952 banner_Boardings NA # ℹ 13,742 more rows
Now each var
is Boardings, Averages, or Alightings. We use “into =
” to name the new columns and “sep =
” to show where the separation should happen.
long <- long %>% separate(name, into = c("line", "type"), sep = "_") long
# A tibble: 13,752 × 6 day date daily line type value <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # ℹ 13,742 more rows
pivot_wider
…pivot_wider()
- spreads row data into columns (tidyr
package)
names_from =
the old column whose contents will be spread into multiple new column names.values_from =
the old column whose contents will fill in the values of those new columns.{wide_data} <- {long_data} %>% pivot_wider(names_from = {Old column name: contains new column names}, values_from = {Old column name: contains new cell values})
long_data
# A tibble: 3 × 2 Month Rate <chr> <dbl> 1 June_vacc_rate 0.516 2 May_vacc_rate 0.514 3 April_vacc_rate 0.511
wide_data <- long_data %>% pivot_wider(names_from = "Month", values_from = "Rate") wide_data
# A tibble: 1 × 3 June_vacc_rate May_vacc_rate April_vacc_rate <dbl> <dbl> <dbl> 1 0.516 0.514 0.511
long
# A tibble: 13,752 × 6 day date daily line type value <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # ℹ 13,742 more rows
wide <- long %>% pivot_wider(names_from = "type", values_from = "value") wide
# A tibble: 4,584 × 7 day date daily line Boardings Alightings Average <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 Monday 01/11/2010 952 orange 877 1027 952 2 Monday 01/11/2010 952 purple NA NA NA 3 Monday 01/11/2010 952 green NA NA NA 4 Monday 01/11/2010 952 banner NA NA NA 5 Tuesday 01/12/2010 796 orange 777 815 796 6 Tuesday 01/12/2010 796 purple NA NA NA 7 Tuesday 01/12/2010 796 green NA NA NA 8 Tuesday 01/12/2010 796 banner NA NA NA 9 Wednesday 01/13/2010 1212. orange 1203 1220 1212. 10 Wednesday 01/13/2010 1212. purple NA NA NA # ℹ 4,574 more rows
the datasets::airquality
data shows various air quality metrics measured in New York in 1973.
air <- datasets::airquality %>% select(Temp, Month, Day) air
Temp Month Day 1 67 5 1 2 72 5 2 3 74 5 3 4 62 5 4 5 56 5 5 6 66 5 6 7 65 5 7 8 59 5 8 9 61 5 9 10 69 5 10 11 74 5 11 12 69 5 12 13 66 5 13 14 68 5 14 15 58 5 15 16 64 5 16 17 66 5 17 18 57 5 18 19 68 5 19 20 62 5 20 21 59 5 21 22 73 5 22 23 61 5 23 24 61 5 24 25 57 5 25 26 58 5 26 27 57 5 27 28 67 5 28 29 81 5 29 30 79 5 30 31 76 5 31 32 78 6 1 33 74 6 2 34 67 6 3 35 84 6 4 36 85 6 5 37 79 6 6 38 82 6 7 39 87 6 8 40 90 6 9 41 87 6 10 42 93 6 11 43 92 6 12 44 82 6 13 45 80 6 14 46 79 6 15 47 77 6 16 48 72 6 17 49 65 6 18 50 73 6 19 51 76 6 20 52 77 6 21 53 76 6 22 54 76 6 23 55 76 6 24 56 75 6 25 57 78 6 26 58 73 6 27 59 80 6 28 60 77 6 29 61 83 6 30 62 84 7 1 63 85 7 2 64 81 7 3 65 84 7 4 66 83 7 5 67 83 7 6 68 88 7 7 69 92 7 8 70 92 7 9 71 89 7 10 72 82 7 11 73 73 7 12 74 81 7 13 75 91 7 14 76 80 7 15 77 81 7 16 78 82 7 17 79 84 7 18 80 87 7 19 81 85 7 20 82 74 7 21 83 81 7 22 84 82 7 23 85 86 7 24 86 85 7 25 87 82 7 26 88 86 7 27 89 88 7 28 90 86 7 29 91 83 7 30 92 81 7 31 93 81 8 1 94 81 8 2 95 82 8 3 96 86 8 4 97 85 8 5 98 87 8 6 99 89 8 7 100 90 8 8 101 90 8 9 102 92 8 10 103 86 8 11 104 86 8 12 105 82 8 13 106 80 8 14 107 79 8 15 108 77 8 16 109 79 8 17 110 76 8 18 111 78 8 19 112 78 8 20 113 77 8 21 114 72 8 22 115 75 8 23 116 79 8 24 117 81 8 25 118 86 8 26 119 88 8 27 120 97 8 28 121 94 8 29 122 96 8 30 123 94 8 31 124 91 9 1 125 92 9 2 126 93 9 3 127 93 9 4 128 87 9 5 129 84 9 6 130 80 9 7 131 78 9 8 132 75 9 9 133 73 9 10 134 81 9 11 135 76 9 12 136 77 9 13 137 71 9 14 138 71 9 15 139 78 9 16 140 67 9 17 141 76 9 18 142 68 9 19 143 82 9 20 144 64 9 21 145 71 9 22 146 81 9 23 147 69 9 24 148 63 9 25 149 70 9 26 150 77 9 27 151 75 9 28 152 76 9 29 153 68 9 30
Let’s pivot Month
wider: but it might be helpful to add “Month” to the new column name so it isn’t just numbers.
air %>% pivot_wider(names_from = "Month", values_from = "Temp")
# A tibble: 31 × 6 Day `5` `6` `7` `8` `9` <int> <int> <int> <int> <int> <int> 1 1 67 78 84 81 91 2 2 72 74 85 81 92 3 3 74 67 81 82 93 4 4 62 84 84 86 93 5 5 56 85 83 85 87 6 6 66 79 83 87 84 7 7 65 82 88 89 80 8 8 59 87 92 90 78 9 9 61 90 92 90 75 10 10 69 87 89 92 73 # ℹ 21 more rows
Much better!
air %>% pivot_wider(names_from = "Month", values_from = "Temp", names_prefix = "Month_")
# A tibble: 31 × 6 Day Month_5 Month_6 Month_7 Month_8 Month_9 <int> <int> <int> <int> <int> <int> 1 1 67 78 84 81 91 2 2 72 74 85 81 92 3 3 74 67 81 82 93 4 4 62 84 84 86 93 5 5 56 85 83 85 87 6 6 66 79 83 87 84 7 7 65 82 88 89 80 8 8 59 87 92 90 78 9 9 61 90 92 90 75 10 10 69 87 89 92 73 # ℹ 21 more rows
tidyr
package helps us convert between wide and long datapivot_longer()
goes from wide -> long
names_to =
and values_to =
for custom namingpivot_wider()
goes from long -> wide
names_from =
and values_from =