library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
- Read in the UFO dataset (used in the Data IO lectures) as an R
object called
ufo
. You can read directly from the web here:
https://raw.githubusercontent.com/SISBID/Module1/gh-pages/data/ufo/ufo_data_complete.csv
. You can ignore the “problems” with some rows.
library(readr)
ufo <- read_csv("https://raw.githubusercontent.com/SISBID/Module1/gh-pages/data/ufo/ufo_data_complete.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 88875 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
## dbl (1): duration (seconds)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Clean up the column/variable names of the
ufo
dataset
to remove spaces and non-alphanumeric characters. You can use the
dplyr::rename()
function or look into the
janitor::clean_names()
function. save the data as
ufo_clean
.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
ufo_clean <- clean_names(ufo)
- Filter for rows where state is “tx” or “nm”, “ut”. Then use
recode
to make an exact swap of “Texas” for “tx” and
“New_Mexico” for “nm” and “Utah” for “ut” of the state
variable. Save the output as South_West
. hint- you will
need mutate
.
South_West <- ufo_clean %>% filter(state %in% c("tx", "nm", "ut")) %>%
mutate(state = recode(state,
"tx" = "Texas",
"nm" = "New_Mexico" ,
"ut" = "Utah"))
South_West
## # A tibble: 5,763 × 11
## datetime city state country shape duration_seconds duration_hours_min
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1949 20:… san … Texas us cyli… 2700 45 minutes
## 2 10/10/1949 21:… lack… Texas <NA> light 7200 1-2 hrs
## 3 10/10/1956 21:… edna Texas us circ… 20 1/2 hour
## 4 10/10/1977 12:… san … Texas us other 30 30 seconds
## 5 10/10/1980 19:… hous… Texas us sphe… 180 3 min
## 6 10/10/1980 22:… dall… Texas us unkn… 300 5 minutes
## 7 10/10/1984 05:… hous… Texas us circ… 60 1 minute
## 8 10/10/1992 18:… staf… Texas us unkn… 10 10 seconds
## 9 10/10/1992 22:… weat… Texas us unkn… 30 30 seconds
## 10 10/10/1994 15:… merc… Texas <NA> cigar 3600 1 hour
## # ℹ 5,753 more rows
## # ℹ 4 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
## # longitude <chr>
- Use
case_when()
to create a new variable called
“continent”. If the country is “ca” or “us” make the value be “North
America”, if it is “gb” or “de” make the value “Europe”, and if it is
“au” make it “Australia”. No need to worry about the TRUE
statement as we want to keep our other NA
values.
ufo_clean %>%
mutate(continent = case_when(country %in% c("ca", "us") ~ "North America",
country %in% c("gb", "de") ~ "Europe",
country == "au" ~ "Australia"))
## # A tibble: 88,875 × 12
## datetime city state country shape duration_seconds duration_hours_min
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/1949 20:… san … tx us cyli… 2700 45 minutes
## 2 10/10/1949 21:… lack… tx <NA> light 7200 1-2 hrs
## 3 10/10/1955 17:… ches… <NA> gb circ… 20 20 seconds
## 4 10/10/1956 21:… edna tx us circ… 20 1/2 hour
## 5 10/10/1960 20:… kane… hi us light 900 15 minutes
## 6 10/10/1961 19:… bris… tn us sphe… 300 5 minutes
## 7 10/10/1965 21:… pena… <NA> gb circ… 180 about 3 mins
## 8 10/10/1965 23:… norw… ct us disk 1200 20 minutes
## 9 10/10/1966 20:… pell… al us disk 180 3 minutes
## 10 10/10/1966 21:… live… fl us disk 120 several minutes
## # ℹ 88,865 more rows
## # ℹ 5 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
## # longitude <chr>, continent <chr>