Data Reshaping

Reshaping: wide vs. long data

https://github.com/gadenbuie/tidyexplain/blob/main/images/tidyr-pivoting.gif

A gif visualization of data going from wide to long form.

What is wide/long data?

Data is stored differently in the tibble.

Wide: has many columns

# A tibble: 1 × 4
  State   June_vacc_rate May_vacc_rate April_vacc_rate
  <chr>            <dbl>         <dbl>           <dbl>
1 Alabama          0.516         0.514           0.511

Long: column names become data

# A tibble: 3 × 3
  State   name            value
  <chr>   <chr>           <dbl>
1 Alabama June_vacc_rate  0.516
2 Alabama May_vacc_rate   0.514
3 Alabama April_vacc_rate 0.511

What is wide/long data?

Wide: multiple columns per individual, values spread across multiple columns

# A tibble: 2 × 4
  State   June_vacc_rate May_vacc_rate April_vacc_rate
  <chr>            <dbl>         <dbl>           <dbl>
1 Alabama          0.516         0.514           0.511
2 Alaska           0.627         0.626           0.623

Long: multiple rows per observation, a single column contains the values

# A tibble: 6 × 3
  State   name            value
  <chr>   <chr>           <dbl>
1 Alabama June_vacc_rate  0.516
2 Alabama May_vacc_rate   0.514
3 Alabama April_vacc_rate 0.511
4 Alaska  June_vacc_rate  0.627
5 Alaska  May_vacc_rate   0.626
6 Alaska  April_vacc_rate 0.623

What is wide/long data?

Data is wide or long with respect to certain variables.

Wide versus long data rearanges the position of column names and row content.

Why do we need to switch between wide/long data?

Wide: Easier for humans to read

# A tibble: 2 × 4
  State   June_vacc_rate May_vacc_rate April_vacc_rate
  <chr>            <dbl>         <dbl>           <dbl>
1 Alabama          0.516         0.514           0.511
2 Alaska           0.627         0.626           0.623

Long: Easier for R to make plots & do analysis

# A tibble: 6 × 3
  State   name            value
  <chr>   <chr>           <dbl>
1 Alabama June_vacc_rate  0.516
2 Alabama May_vacc_rate   0.514
3 Alabama April_vacc_rate 0.511
4 Alaska  June_vacc_rate  0.627
5 Alaska  May_vacc_rate   0.626
6 Alaska  April_vacc_rate 0.623

Pivoting using `tidyr` package

tidyr allows you to “tidy” your data. We will be talking about:

pivot_longer - make multiple columns into variables, (wide to long)
pivot_wider - make a variable into multiple columns, (long to wide)
separate - string into multiple columns

The reshape command exists. Its arguments are considered more confusing, so we don’t recommend it.

You might see old functions gather and spread when googling. These are older iterations of pivot_longer and pivot_wider, respectively.

`pivot_longer`…

Reshaping data from wide to long

pivot_longer() - puts column data into rows (tidyr package)

First describe which columns we want to “pivot_longer”

{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot})

Reshaping data from wide to long

wide_data

# A tibble: 1 × 3
  June_vacc_rate May_vacc_rate April_vacc_rate
           <dbl>         <dbl>           <dbl>
1          0.516         0.514           0.511

long_data <- wide_data %>% pivot_longer(cols = everything())
long_data

# A tibble: 3 × 2
  name            value
  <chr>           <dbl>
1 June_vacc_rate  0.516
2 May_vacc_rate   0.514
3 April_vacc_rate 0.511

Reshaping data from wide to long

pivot_longer() - puts column data into rows (tidyr package)

First describe which columns we want to “pivot_longer”
names_to = gives a new name to the pivoted columns
values_to = gives a new name to the values that used to be in those columns

{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot},
                                        names_to = {New column name: contains old column names},
                                        values_to = {New column name: contains cell values})

Reshaping data from wide to long

wide_data

# A tibble: 1 × 3
  June_vacc_rate May_vacc_rate April_vacc_rate
           <dbl>         <dbl>           <dbl>
1          0.516         0.514           0.511

long_data <- wide_data %>% pivot_longer(cols = everything(),
                                        names_to = "Month",
                                        values_to = "Rate")
long_data

# A tibble: 3 × 2
  Month            Rate
  <chr>           <dbl>
1 June_vacc_rate  0.516
2 May_vacc_rate   0.514
3 April_vacc_rate 0.511

Newly created column names are enclosed in quotation marks.

Data used: Charm City Circulator

https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv

circ <- 
  read_csv("https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv")
head(circ, 5)

# A tibble: 5 × 15
  day      date  orangeBoardings orangeAlightings orangeAverage purpleBoardings purpleAlightings purpleAverage
  <chr>    <chr>           <dbl>            <dbl>         <dbl>           <dbl>            <dbl>         <dbl>
1 Monday   01/1…             877             1027          952               NA               NA            NA
2 Tuesday  01/1…             777              815          796               NA               NA            NA
3 Wednesd… 01/1…            1203             1220         1212.              NA               NA            NA
4 Thursday 01/1…            1194             1233         1214.              NA               NA            NA
5 Friday   01/1…            1645             1643         1644               NA               NA            NA
# ℹ 7 more variables: greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, bannerBoardings <dbl>,
#   bannerAlightings <dbl>, bannerAverage <dbl>, daily <dbl>

Reshaping data from wide to long

long <- circ %>% 
  pivot_longer(starts_with(c("orange","purple","green","banner")))
long

# A tibble: 13,752 × 5
   day    date       daily name             value
   <chr>  <chr>      <dbl> <chr>            <dbl>
 1 Monday 01/11/2010   952 orangeBoardings    877
 2 Monday 01/11/2010   952 orangeAlightings  1027
 3 Monday 01/11/2010   952 orangeAverage      952
 4 Monday 01/11/2010   952 purpleBoardings     NA
 5 Monday 01/11/2010   952 purpleAlightings    NA
 6 Monday 01/11/2010   952 purpleAverage       NA
 7 Monday 01/11/2010   952 greenBoardings      NA
 8 Monday 01/11/2010   952 greenAlightings     NA
 9 Monday 01/11/2010   952 greenAverage        NA
10 Monday 01/11/2010   952 bannerBoardings     NA
# ℹ 13,742 more rows

Reshaping data from wide to long

There are many ways to select the columns we want. Use ?tidyr_tidy_select to look at more column selection options.

long <- circ %>% 
  pivot_longer( !c(day, date, daily))
long

# A tibble: 13,752 × 5
   day    date       daily name             value
   <chr>  <chr>      <dbl> <chr>            <dbl>
 1 Monday 01/11/2010   952 orangeBoardings    877
 2 Monday 01/11/2010   952 orangeAlightings  1027
 3 Monday 01/11/2010   952 orangeAverage      952
 4 Monday 01/11/2010   952 purpleBoardings     NA
 5 Monday 01/11/2010   952 purpleAlightings    NA
 6 Monday 01/11/2010   952 purpleAverage       NA
 7 Monday 01/11/2010   952 greenBoardings      NA
 8 Monday 01/11/2010   952 greenAlightings     NA
 9 Monday 01/11/2010   952 greenAverage        NA
10 Monday 01/11/2010   952 bannerBoardings     NA
# ℹ 13,742 more rows

Cleaning up long data

We will use str_replace from the stringr package to put _ in the names

long <- long %>% mutate(
  name = str_replace(name, "Board", "_Board"),
  name = str_replace(name, "Alight", "_Alight"),
  name = str_replace(name, "Average", "_Average") 
)
long

# A tibble: 13,752 × 5
   day    date       daily name              value
   <chr>  <chr>      <dbl> <chr>             <dbl>
 1 Monday 01/11/2010   952 orange_Boardings    877
 2 Monday 01/11/2010   952 orange_Alightings  1027
 3 Monday 01/11/2010   952 orange_Average      952
 4 Monday 01/11/2010   952 purple_Boardings     NA
 5 Monday 01/11/2010   952 purple_Alightings    NA
 6 Monday 01/11/2010   952 purple_Average       NA
 7 Monday 01/11/2010   952 green_Boardings      NA
 8 Monday 01/11/2010   952 green_Alightings     NA
 9 Monday 01/11/2010   952 green_Average        NA
10 Monday 01/11/2010   952 banner_Boardings     NA
# ℹ 13,742 more rows

Cleaning up long data

Now each var is Boardings, Averages, or Alightings. We use “into =” to name the new columns and “sep =” to show where the separation should happen.

long <- long %>% 
  separate(name, into = c("line", "type"), sep = "_")
long

# A tibble: 13,752 × 6
   day    date       daily line   type       value
   <chr>  <chr>      <dbl> <chr>  <chr>      <dbl>
 1 Monday 01/11/2010   952 orange Boardings    877
 2 Monday 01/11/2010   952 orange Alightings  1027
 3 Monday 01/11/2010   952 orange Average      952
 4 Monday 01/11/2010   952 purple Boardings     NA
 5 Monday 01/11/2010   952 purple Alightings    NA
 6 Monday 01/11/2010   952 purple Average       NA
 7 Monday 01/11/2010   952 green  Boardings     NA
 8 Monday 01/11/2010   952 green  Alightings    NA
 9 Monday 01/11/2010   952 green  Average       NA
10 Monday 01/11/2010   952 banner Boardings     NA
# ℹ 13,742 more rows

`pivot_wider`…

Reshaping data from long to wide

pivot_wider() - spreads row data into columns (tidyr package)

names_from = the old column whose contents will be spread into multiple new column names.
values_from = the old column whose contents will fill in the values of those new columns.

{wide_data} <- {long_data} %>% 
  pivot_wider(names_from = {Old column name: contains new column names},
              values_from = {Old column name: contains new cell values})

Reshaping data from long to wide

long_data

# A tibble: 3 × 2
  Month            Rate
  <chr>           <dbl>
1 June_vacc_rate  0.516
2 May_vacc_rate   0.514
3 April_vacc_rate 0.511

wide_data <- long_data %>% pivot_wider(names_from = "Month", 
                                       values_from = "Rate") 
wide_data

# A tibble: 1 × 3
  June_vacc_rate May_vacc_rate April_vacc_rate
           <dbl>         <dbl>           <dbl>
1          0.516         0.514           0.511

Reshaping Charm City Circulator

long

# A tibble: 13,752 × 6
   day    date       daily line   type       value
   <chr>  <chr>      <dbl> <chr>  <chr>      <dbl>
 1 Monday 01/11/2010   952 orange Boardings    877
 2 Monday 01/11/2010   952 orange Alightings  1027
 3 Monday 01/11/2010   952 orange Average      952
 4 Monday 01/11/2010   952 purple Boardings     NA
 5 Monday 01/11/2010   952 purple Alightings    NA
 6 Monday 01/11/2010   952 purple Average       NA
 7 Monday 01/11/2010   952 green  Boardings     NA
 8 Monday 01/11/2010   952 green  Alightings    NA
 9 Monday 01/11/2010   952 green  Average       NA
10 Monday 01/11/2010   952 banner Boardings     NA
# ℹ 13,742 more rows

Reshaping Charm City Circulator

wide <- long %>% pivot_wider(names_from = "type", 
                             values_from = "value") 
wide

# A tibble: 4,584 × 7
   day       date       daily line   Boardings Alightings Average
   <chr>     <chr>      <dbl> <chr>      <dbl>      <dbl>   <dbl>
 1 Monday    01/11/2010  952  orange       877       1027    952 
 2 Monday    01/11/2010  952  purple        NA         NA     NA 
 3 Monday    01/11/2010  952  green         NA         NA     NA 
 4 Monday    01/11/2010  952  banner        NA         NA     NA 
 5 Tuesday   01/12/2010  796  orange       777        815    796 
 6 Tuesday   01/12/2010  796  purple        NA         NA     NA 
 7 Tuesday   01/12/2010  796  green         NA         NA     NA 
 8 Tuesday   01/12/2010  796  banner        NA         NA     NA 
 9 Wednesday 01/13/2010 1212. orange      1203       1220   1212.
10 Wednesday 01/13/2010 1212. purple        NA         NA     NA 
# ℹ 4,574 more rows

Adding prefixes

Prefixes when pivoting

the datasets::airquality data shows various air quality metrics measured in New York in 1973.

air <- datasets::airquality %>% select(Temp, Month, Day)
air

    Temp Month Day
1     67     5   1
2     72     5   2
3     74     5   3
4     62     5   4
5     56     5   5
6     66     5   6
7     65     5   7
8     59     5   8
9     61     5   9
10    69     5  10
11    74     5  11
12    69     5  12
13    66     5  13
14    68     5  14
15    58     5  15
16    64     5  16
17    66     5  17
18    57     5  18
19    68     5  19
20    62     5  20
21    59     5  21
22    73     5  22
23    61     5  23
24    61     5  24
25    57     5  25
26    58     5  26
27    57     5  27
28    67     5  28
29    81     5  29
30    79     5  30
31    76     5  31
32    78     6   1
33    74     6   2
34    67     6   3
35    84     6   4
36    85     6   5
37    79     6   6
38    82     6   7
39    87     6   8
40    90     6   9
41    87     6  10
42    93     6  11
43    92     6  12
44    82     6  13
45    80     6  14
46    79     6  15
47    77     6  16
48    72     6  17
49    65     6  18
50    73     6  19
51    76     6  20
52    77     6  21
53    76     6  22
54    76     6  23
55    76     6  24
56    75     6  25
57    78     6  26
58    73     6  27
59    80     6  28
60    77     6  29
61    83     6  30
62    84     7   1
63    85     7   2
64    81     7   3
65    84     7   4
66    83     7   5
67    83     7   6
68    88     7   7
69    92     7   8
70    92     7   9
71    89     7  10
72    82     7  11
73    73     7  12
74    81     7  13
75    91     7  14
76    80     7  15
77    81     7  16
78    82     7  17
79    84     7  18
80    87     7  19
81    85     7  20
82    74     7  21
83    81     7  22
84    82     7  23
85    86     7  24
86    85     7  25
87    82     7  26
88    86     7  27
89    88     7  28
90    86     7  29
91    83     7  30
92    81     7  31
93    81     8   1
94    81     8   2
95    82     8   3
96    86     8   4
97    85     8   5
98    87     8   6
99    89     8   7
100   90     8   8
101   90     8   9
102   92     8  10
103   86     8  11
104   86     8  12
105   82     8  13
106   80     8  14
107   79     8  15
108   77     8  16
109   79     8  17
110   76     8  18
111   78     8  19
112   78     8  20
113   77     8  21
114   72     8  22
115   75     8  23
116   79     8  24
117   81     8  25
118   86     8  26
119   88     8  27
120   97     8  28
121   94     8  29
122   96     8  30
123   94     8  31
124   91     9   1
125   92     9   2
126   93     9   3
127   93     9   4
128   87     9   5
129   84     9   6
130   80     9   7
131   78     9   8
132   75     9   9
133   73     9  10
134   81     9  11
135   76     9  12
136   77     9  13
137   71     9  14
138   71     9  15
139   78     9  16
140   67     9  17
141   76     9  18
142   68     9  19
143   82     9  20
144   64     9  21
145   71     9  22
146   81     9  23
147   69     9  24
148   63     9  25
149   70     9  26
150   77     9  27
151   75     9  28
152   76     9  29
153   68     9  30

Prefixes when pivoting

Let’s pivot Month wider: but it might be helpful to add “Month” to the new column name so it isn’t just numbers.

air %>% pivot_wider(names_from = "Month", 
                    values_from = "Temp")

# A tibble: 31 × 6
     Day   `5`   `6`   `7`   `8`   `9`
   <int> <int> <int> <int> <int> <int>
 1     1    67    78    84    81    91
 2     2    72    74    85    81    92
 3     3    74    67    81    82    93
 4     4    62    84    84    86    93
 5     5    56    85    83    85    87
 6     6    66    79    83    87    84
 7     7    65    82    88    89    80
 8     8    59    87    92    90    78
 9     9    61    90    92    90    75
10    10    69    87    89    92    73
# ℹ 21 more rows

Prefixes when pivoting

Much better!

air %>% pivot_wider(names_from = "Month", 
                    values_from = "Temp",
                    names_prefix = "Month_")

# A tibble: 31 × 6
     Day Month_5 Month_6 Month_7 Month_8 Month_9
   <int>   <int>   <int>   <int>   <int>   <int>
 1     1      67      78      84      81      91
 2     2      72      74      85      81      92
 3     3      74      67      81      82      93
 4     4      62      84      84      86      93
 5     5      56      85      83      85      87
 6     6      66      79      83      87      84
 7     7      65      82      88      89      80
 8     8      59      87      92      90      78
 9     9      61      90      92      90      75
10    10      69      87      89      92      73
# ℹ 21 more rows

Summary

tidyr package helps us convert between wide and long data
pivot_longer() goes from wide -> long
- Specify columns you want to pivot
- Specify names_to = and values_to = for custom naming
pivot_wider() goes from long -> wide
- Specify names_from = and values_from =

Reshaping: wide vs. long data

What is wide/long data?

What is wide/long data?

What is wide/long data?

Why do we need to switch between wide/long data?

Pivoting using tidyr package

pivot_longer…

Reshaping data from wide to long

Reshaping data from wide to long

Reshaping data from wide to long

Reshaping data from wide to long

Data used: Charm City Circulator

Reshaping data from wide to long

Reshaping data from wide to long

Cleaning up long data

Cleaning up long data

pivot_wider…

Reshaping data from long to wide

Reshaping data from long to wide

Reshaping Charm City Circulator

Reshaping Charm City Circulator

Adding prefixes

Prefixes when pivoting

Prefixes when pivoting

Prefixes when pivoting

Summary

Pivoting using `tidyr` package

`pivot_longer`…

`pivot_wider`…