Merging and Joining Lab Key

Part 1

Read in the data and use functions of your choice to preview it.

library(tidyverse)

crash <- read_csv("https://sisbid.github.io/Data-Wrangling/labs/crashes.csv")
road <- read_csv("https://sisbid.github.io/Data-Wrangling/labs/roads.csv")
head(crash)

## # A tibble: 6 × 4
##    Year Road          N_Crashes Volume
##   <dbl> <chr>             <dbl>  <dbl>
## 1  1991 Interstate 65        25  40000
## 2  1992 Interstate 65        37  41000
## 3  1993 Interstate 65        45  45000
## 4  1994 Interstate 65        46  45600
## 5  1995 Interstate 65        46  49000
## 6  1996 Interstate 65        59  51000

head(road)

## # A tibble: 5 × 3
##   Road          District       Length
##   <chr>         <chr>           <dbl>
## 1 Interstate 65 Greenfield        262
## 2 Interstate 70 Vincennes         156
## 3 US-36         Crawfordsville    139
## 4 US-40         Greenfield        150
## 5 US-52         Crawfordsville    172

Join data to retain only complete data, (using an inner join) e.g. those observations with road lengths and districts. Merge without using by argument, then merge using by = "Road". call the output merged. How many observations are there?

merged <- inner_join(crash, road)

## Joining with `by = join_by(Road)`

merged <- inner_join(crash, road, by = "Road")
nrow(merged)

## [1] 88

Join data using a full_join. Call the output full. How many observations are there?

full <- full_join(crash, road)

## Joining with `by = join_by(Road)`

nrow(full)

## [1] 111

Do a left join of the road and crash. ORDER matters here! How many observations are there?

left <- left_join(road, crash)

## Joining with `by = join_by(Road)`

nrow(left)

## [1] 89

Repeat above with a right_join with the same order of the arguments. How many observations are there?

right <- right_join(road, crash)

## Joining with `by = join_by(Road)`

nrow(right)

## [1] 110

What road data is missing from crash?

roads1 <- road %>% pull(Road)
roads2 <- crash %>% pull(Road)
setdiff(roads1, roads2) # This value is in `road` but not `crash`

## [1] "US-52"

# Could also search for NAs created by the join
full %>% filter(is.na(N_Crashes))

## # A tibble: 1 × 6
##    Year Road  N_Crashes Volume District       Length
##   <dbl> <chr>     <dbl>  <dbl> <chr>           <dbl>
## 1    NA US-52        NA     NA Crawfordsville    172

anti_join(road, crash)

## Joining with `by = join_by(Road)`

## # A tibble: 1 × 3
##   Road  District       Length
##   <chr> <chr>           <dbl>
## 1 US-52 Crawfordsville    172

What crash data is missing from `road``?

roads1 <- road %>% pull(Road)
roads2 <- crash %>% pull(Road)
setdiff(roads2, roads1) # These values are in `crash` but not `road`

## [1] "Interstate 275"

# Could also search for NAs created by the join. Would be good to summarize with `count`
full %>% filter(is.na(District)) %>% count(Road)

## # A tibble: 1 × 2
##   Road               n
##   <chr>          <int>
## 1 Interstate 275    22

anti_join(crash, road)

## Joining with `by = join_by(Road)`

## # A tibble: 22 × 4
##     Year Road           N_Crashes Volume
##    <dbl> <chr>              <dbl>  <dbl>
##  1  1991 Interstate 275        27  20350
##  2  1992 Interstate 275        26  21200
##  3  1993 Interstate 275        22  23200
##  4  1994 Interstate 275        21  21200
##  5  1995 Interstate 275        28  23200
##  6  1996 Interstate 275        22  20000
##  7  1997 Interstate 275        27  18000
##  8  1998 Interstate 275        21  19500
##  9  1999 Interstate 275        22  21000
## 10  2000 Interstate 275        29  20700
## # ℹ 12 more rows

Merging and Joining Lab Key

Data Wrangling in R

Part 1