library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
- Read in the UFO dataset (used in the Data IO lectures) as an R
object called
ufo
. You can read directly from the web here:
https://raw.githubusercontent.com/SISBID/Module1/gh-pages/data/ufo/ufo_data_complete.csv
. You can ignore the “problems” with some rows.
library(readr)
ufo <- read_csv("https://raw.githubusercontent.com/SISBID/Module1/gh-pages/data/ufo/ufo_data_complete.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl (1): duration (seconds)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Clean up the column/variable names of the
ufo
dataset
to remove spaces and non-alphanumeric characters. You can use the
dplyr::rename()
function or look into the
janitor::clean_names()
function. save the data as
ufo_clean
.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
ufo_clean <- clean_names(ufo)
- Filter for rows where state is “tx” or “nm”, “ut”. Then use
recode
to make an exact swap of “Texas” for “tx” and
“New_Mexico” for “nm” and “Utah” for “ut” of the state
variable. Save the output as South_West
. hint- you will
need mutate
.
South_West <- ufo_clean %>% filter(state %in% c("tx", "nm", "ut")) %>%
mutate(recode(state, "Texas" = "tx",
"New_Mexico" = "nm",
"Utah" = "ut"))
South_West
## # A tibble: 5,763 × 12
## datetime city state country shape durat…¹ durat…² comme…³ date_…⁴ latit…⁵
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 10/10/1949… san … tx us cyli… 2700 45 min… This e… 4/27/2… 29.883…
## 2 10/10/1949… lack… tx <NA> light 7200 1-2 hrs 1949 L… 12/16/… 29.384…
## 3 10/10/1956… edna tx us circ… 20 1/2 ho… My old… 1/17/2… 28.978…
## 4 10/10/1977… san … tx us other 30 30 sec… i was … 2/24/2… 29.423…
## 5 10/10/1980… hous… tx us sphe… 180 3 min Sphere… 4/16/2… 29.763…
## 6 10/10/1980… dall… tx us unkn… 300 5 minu… Strang… 10/28/… 32.783…
## 7 10/10/1984… hous… tx us circ… 60 1 minu… 2 expe… 4/18/2… 29.763…
## 8 10/10/1992… staf… tx us unkn… 10 10 sec… A man … 4/18/2… 29.615…
## 9 10/10/1992… weat… tx us unkn… 30 30 sec… Black … 9/2/20… 32.759…
## 10 10/10/1994… merc… tx <NA> cigar 3600 1 hour ufo ch… 12/12/… 26.149…
## # … with 5,753 more rows, 2 more variables: longitude <chr>,
## # `recode(state, Texas = "tx", New_Mexico = "nm", Utah = "ut")` <chr>, and
## # abbreviated variable names ¹duration_seconds, ²duration_hours_min,
## # ³comments, ⁴date_posted, ⁵latitude
- Use
case_when()
to create a new variable called
“continent”. If the country is “ca” or “us” make the value be “North
America”, if it is “gb” or “de” make the value “Europe”, and if it is
“au” make it “Australia”. No need to worry about the TRUE
statement as we want to keep our other NA
values.
ufo_clean %>%
mutate(continent = case_when(country %in% c("ca", "us") ~ "North America",
country %in% c("gb", "de") ~ "Europe",
country == "au" ~ "Australia"))
## # A tibble: 88,875 × 12
## datetime city state country shape durat…¹ durat…² comme…³ date_…⁴ latit…⁵
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 10/10/1949… san … tx us cyli… 2700 45 min… This e… 4/27/2… 29.883…
## 2 10/10/1949… lack… tx <NA> light 7200 1-2 hrs 1949 L… 12/16/… 29.384…
## 3 10/10/1955… ches… <NA> gb circ… 20 20 sec… Green/… 1/21/2… 53.2
## 4 10/10/1956… edna tx us circ… 20 1/2 ho… My old… 1/17/2… 28.978…
## 5 10/10/1960… kane… hi us light 900 15 min… AS a M… 1/22/2… 21.418…
## 6 10/10/1961… bris… tn us sphe… 300 5 minu… My fat… 4/27/2… 36.595…
## 7 10/10/1965… pena… <NA> gb circ… 180 about … penart… 2/14/2… 51.434…
## 8 10/10/1965… norw… ct us disk 1200 20 min… A brig… 10/2/1… 41.117…
## 9 10/10/1966… pell… al us disk 180 3 min… Strobe… 3/19/2… 33.586…
## 10 10/10/1966… live… fl us disk 120 severa… Saucer… 5/11/2… 30.294…
## # … with 88,865 more rows, 2 more variables: longitude <chr>, continent <chr>,
## # and abbreviated variable names ¹duration_seconds, ²duration_hours_min,
## # ³comments, ⁴date_posted, ⁵latitude