Part 1

Read in the data and use functions of your choice to preview it.

library(tidyverse)

crash <- read_csv("https://sisbid.github.io/Data-Wrangling/labs/crashes.csv")
road <- read_csv("https://sisbid.github.io/Data-Wrangling/labs/roads.csv")
head(crash)
# A tibble: 6 × 4
   Year Road          N_Crashes Volume
  <dbl> <chr>             <dbl>  <dbl>
1  1991 Interstate 65        25  40000
2  1992 Interstate 65        37  41000
3  1993 Interstate 65        45  45000
4  1994 Interstate 65        46  45600
5  1995 Interstate 65        46  49000
6  1996 Interstate 65        59  51000
head(road)
# A tibble: 5 × 3
  Road          District       Length
  <chr>         <chr>           <dbl>
1 Interstate 65 Greenfield        262
2 Interstate 70 Vincennes         156
3 US-36         Crawfordsville    139
4 US-40         Greenfield        150
5 US-52         Crawfordsville    172
  1. Join data to retain only complete data, (using an inner join) e.g. those observations with road lengths and districts. Merge without using by argument, then merge using by = "Road". call the output merged. How many observations are there?
merged <- inner_join(crash, road)
merged <- inner_join(crash, road, by = "Road")
nrow(merged)
[1] 88
  1. Join data using a full_join. Call the output full. How many observations are there?
full <- full_join(crash, road)
nrow(full)
[1] 111
  1. Do a left join of the road and crash. ORDER matters here! How many observations are there?
left <- left_join(road, crash)
nrow(left)
[1] 89
  1. Repeat above with a right_join with the same order of the arguments. How many observations are there?
right <- right_join(road, crash)
nrow(right)
[1] 110
  1. What road data is missing from crash?
roads1 <- road %>% pull(Road)
roads2 <- crash %>% pull(Road)
setdiff(roads1, roads2) # This value is in `road` but not `crash`
[1] "US-52"
# Could also search for NAs created by the join
full %>% filter(is.na(N_Crashes))
# A tibble: 1 × 6
   Year Road  N_Crashes Volume District       Length
  <dbl> <chr>     <dbl>  <dbl> <chr>           <dbl>
1    NA US-52        NA     NA Crawfordsville    172
anti_join(road, crash)
# A tibble: 1 × 3
  Road  District       Length
  <chr> <chr>           <dbl>
1 US-52 Crawfordsville    172
  1. What crash data is missing from `road``?
roads1 <- road %>% pull(Road)
roads2 <- crash %>% pull(Road)
setdiff(roads2, roads1) # These values are in `crash` but not `road`
[1] "Interstate 275"
# Could also search for NAs created by the join. Would be good to summarize with `count`
full %>% filter(is.na(District)) %>% count(Road)
# A tibble: 1 × 2
  Road               n
  <chr>          <int>
1 Interstate 275    22
anti_join(crash, road)
# A tibble: 22 × 4
    Year Road           N_Crashes Volume
   <dbl> <chr>              <dbl>  <dbl>
 1  1991 Interstate 275        27  20350
 2  1992 Interstate 275        26  21200
 3  1993 Interstate 275        22  23200
 4  1994 Interstate 275        21  21200
 5  1995 Interstate 275        28  23200
 6  1996 Interstate 275        22  20000
 7  1997 Interstate 275        27  18000
 8  1998 Interstate 275        21  19500
 9  1999 Interstate 275        22  21000
10  2000 Interstate 275        29  20700
# ℹ 12 more rows