Data Summarization Lab Key

Data used

Circulator Lanes Dataset: the data is from https://data.baltimorecity.gov/Transportation/Charm-City-Circulator-Ridership/wwvu-583r

Available on: https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

circ <- read_csv("https://sisbid.github.io/Data-Wrangling/data/Charm_City_Circulator_Ridership.csv")

## Rows: 1146 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Each row is a different day. How many days are in the data set?

nrow(circ)

## [1] 1146

dim(circ)

## [1] 1146   15

circ %>% 
  nrow()

## [1] 1146

What is the total (sum) number of boardings on the green bus (greenBoardings column)?

sum(circ$greenBoardings, na.rm = TRUE)

## [1] 935564

circ %>% pull(greenBoardings) %>% sum(na.rm = TRUE)

## [1] 935564

count(circ, wt = greenBoardings)

## # A tibble: 1 × 1
##        n
##    <dbl>
## 1 935564

How many days are missing daily ridership (daily column)? Use is.na() and sum().

daily <- circ %>% pull(daily)
sum(is.na(daily))

## [1] 124

# Can also
circ %>% 
  count(is.na(daily))

## # A tibble: 2 × 2
##   `is.na(daily)`     n
##   <lgl>          <int>
## 1 FALSE           1022
## 2 TRUE             124

Group the data by day of the week (day). Find the mean daily ridership (daily column). (hint: use group_by and summarize functions)

circ %>% 
  group_by(day) %>% 
  summarize(mean = mean(daily, na.rm = TRUE))

## # A tibble: 7 × 2
##   day        mean
##   <chr>     <dbl>
## 1 Friday    8961.
## 2 Monday    7340.
## 3 Saturday  6743.
## 4 Sunday    4531.
## 5 Thursday  7639.
## 6 Tuesday   7642.
## 7 Wednesday 7779.

Practice on your own

What is the median of orangeBoardings(use median()).

circ %>% 
  summarize(median = median(orangeBoardings, na.rm = TRUE))

## # A tibble: 1 × 1
##   median
##    <dbl>
## 1   3074

# OR 
circ %>% pull(orangeBoardings) %>% median(na.rm = TRUE)

## [1] 3074

Take the median of orangeBoardings(use median()), but this time group by day of the week.

circ %>% 
  group_by(day) %>% 
  summarize(median = median(orangeBoardings, na.rm = TRUE))

## # A tibble: 7 × 2
##   day       median
##   <chr>      <dbl>
## 1 Friday     4014.
## 2 Monday     3336 
## 3 Saturday   2963 
## 4 Sunday     1900 
## 5 Thursday   3485 
## 6 Tuesday    3484 
## 7 Wednesday  3576