Data Cleaning lab key

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read in the UFO dataset (used in the Data IO lectures) as an R object called ufo. You can read directly from the web here: https://raw.githubusercontent.com/SISBID/Module1/gh-pages/data/ufo/ufo_data_complete.csv . You can ignore the “problems” with some rows.

library(readr)
ufo <- read_csv("https://raw.githubusercontent.com/SISBID/Module1/gh-pages/data/ufo/ufo_data_complete.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl  (1): duration (seconds)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up the column/variable names of the ufo dataset to remove spaces and non-alphanumeric characters. You can use the dplyr::rename() function or look into the janitor::clean_names() function. save the data as ufo_clean.

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

ufo_clean <- clean_names(ufo)

Filter for rows where state is “tx” or “nm”, “ut”. Then use recode to make an exact swap of “Texas” for “tx” and “New_Mexico” for “nm” and “Utah” for “ut” of the state variable. Save the output as South_West. hint- you will need mutate.

South_West <- ufo_clean %>% filter(state %in% c("tx", "nm", "ut")) %>%
  mutate(state = recode(state,  
                        "tx" = "Texas",
                        "nm" = "New_Mexico" ,
                        "ut" = "Utah"))
South_West

## # A tibble: 5,763 × 11
##    datetime        city  state country shape duration_seconds duration_hours_min
##    <chr>           <chr> <chr> <chr>   <chr>            <dbl> <chr>             
##  1 10/10/1949 20:… san … Texas us      cyli…             2700 45 minutes        
##  2 10/10/1949 21:… lack… Texas <NA>    light             7200 1-2 hrs           
##  3 10/10/1956 21:… edna  Texas us      circ…               20 1/2 hour          
##  4 10/10/1977 12:… san … Texas us      other               30 30 seconds        
##  5 10/10/1980 19:… hous… Texas us      sphe…              180 3 min             
##  6 10/10/1980 22:… dall… Texas us      unkn…              300 5 minutes         
##  7 10/10/1984 05:… hous… Texas us      circ…               60 1 minute          
##  8 10/10/1992 18:… staf… Texas us      unkn…               10 10 seconds        
##  9 10/10/1992 22:… weat… Texas us      unkn…               30 30 seconds        
## 10 10/10/1994 15:… merc… Texas <NA>    cigar             3600 1 hour            
## # ℹ 5,753 more rows
## # ℹ 4 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
## #   longitude <chr>

Use case_when() to create a new variable called “continent”. If the country is “ca” or “us” make the value be “North America”, if it is “gb” or “de” make the value “Europe”, and if it is “au” make it “Australia”. No need to worry about the TRUE statement as we want to keep our other NA values.

ufo_clean %>%
  mutate(continent = case_when(country %in% c("ca", "us") ~ "North America",
                               country %in% c("gb", "de") ~ "Europe",
                               country == "au" ~ "Australia"))

## # A tibble: 88,875 × 12
##    datetime        city  state country shape duration_seconds duration_hours_min
##    <chr>           <chr> <chr> <chr>   <chr>            <dbl> <chr>             
##  1 10/10/1949 20:… san … tx    us      cyli…             2700 45 minutes        
##  2 10/10/1949 21:… lack… tx    <NA>    light             7200 1-2 hrs           
##  3 10/10/1955 17:… ches… <NA>  gb      circ…               20 20 seconds        
##  4 10/10/1956 21:… edna  tx    us      circ…               20 1/2 hour          
##  5 10/10/1960 20:… kane… hi    us      light              900 15 minutes        
##  6 10/10/1961 19:… bris… tn    us      sphe…              300 5 minutes         
##  7 10/10/1965 21:… pena… <NA>  gb      circ…              180 about 3 mins      
##  8 10/10/1965 23:… norw… ct    us      disk              1200 20 minutes        
##  9 10/10/1966 20:… pell… al    us      disk               180 3  minutes        
## 10 10/10/1966 21:… live… fl    us      disk               120 several minutes   
## # ℹ 88,865 more rows
## # ℹ 5 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
## #   longitude <chr>, continent <chr>