Merging and Joining Lab Key

Part 1

Read in the data and use functions of your choice to preview it.

library(tidyverse)

crash <- read_csv("https://sisbid.github.io/Data-Wrangling/labs/crashes.csv")
road <- read_csv("https://sisbid.github.io/Data-Wrangling/labs/roads.csv")
head(crash)

# A tibble: 6 × 4
   Year Road          N_Crashes Volume
  <dbl> <chr>             <dbl>  <dbl>
1  1991 Interstate 65        25  40000
2  1992 Interstate 65        37  41000
3  1993 Interstate 65        45  45000
4  1994 Interstate 65        46  45600
5  1995 Interstate 65        46  49000
6  1996 Interstate 65        59  51000

head(road)

# A tibble: 5 × 3
  Road          District       Length
  <chr>         <chr>           <dbl>
1 Interstate 65 Greenfield        262
2 Interstate 70 Vincennes         156
3 US-36         Crawfordsville    139
4 US-40         Greenfield        150
5 US-52         Crawfordsville    172

Join data to retain only complete data, (using an inner join) e.g. those observations with road lengths and districts. Merge without using by argument, then merge using by = "Road". call the output merged. How many observations are there?

merged <- inner_join(crash, road)
merged <- inner_join(crash, road, by = "Road")
nrow(merged)

[1] 88

Join data using a full_join. Call the output full. How many observations are there?

full <- full_join(crash, road)
nrow(full)

[1] 111

Do a left join of the road and crash. ORDER matters here! How many observations are there?

left <- left_join(road, crash)
nrow(left)

[1] 89

Repeat above with a right_join with the same order of the arguments. How many observations are there?

right <- right_join(road, crash)
nrow(right)

[1] 110

What road data is missing from crash?

roads1 <- road %>% pull(Road)
roads2 <- crash %>% pull(Road)
setdiff(roads1, roads2) # This value is in `road` but not `crash`

[1] "US-52"

# Could also search for NAs created by the join
full %>% filter(is.na(N_Crashes))

# A tibble: 1 × 6
   Year Road  N_Crashes Volume District       Length
  <dbl> <chr>     <dbl>  <dbl> <chr>           <dbl>
1    NA US-52        NA     NA Crawfordsville    172

anti_join(road, crash)

# A tibble: 1 × 3
  Road  District       Length
  <chr> <chr>           <dbl>
1 US-52 Crawfordsville    172

What crash data is missing from `road``?

roads1 <- road %>% pull(Road)
roads2 <- crash %>% pull(Road)
setdiff(roads2, roads1) # These values are in `crash` but not `road`

[1] "Interstate 275"

# Could also search for NAs created by the join. Would be good to summarize with `count`
full %>% filter(is.na(District)) %>% count(Road)

# A tibble: 1 × 2
  Road               n
  <chr>          <int>
1 Interstate 275    22

anti_join(crash, road)

# A tibble: 22 × 4
    Year Road           N_Crashes Volume
   <dbl> <chr>              <dbl>  <dbl>
 1  1991 Interstate 275        27  20350
 2  1992 Interstate 275        26  21200
 3  1993 Interstate 275        22  23200
 4  1994 Interstate 275        21  21200
 5  1995 Interstate 275        28  23200
 6  1996 Interstate 275        22  20000
 7  1997 Interstate 275        27  18000
 8  1998 Interstate 275        21  19500
 9  1999 Interstate 275        22  21000
10  2000 Interstate 275        29  20700
# ℹ 12 more rows

Merging and Joining Lab Key

Data Wrangling in R

Part 1