Data Wrangling in R
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
Read in data or download from: http://sisbid.github.io/Data-Wrangling/data/ufo/ufo_data_complete.csv.gz
ufo <- read_delim(
"https://sisbid.github.io/Data-Wrangling/data/ufo/ufo_data_complete.csv",
delim = ",")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 88875 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): datetime, city, state, country, shape, duration (hours/min), comme...
dbl (1): duration (seconds)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You saw warning messages when reading in this dataset. We can see
these with the problems()
function from
readr
.
If we scroll through we can see some interesting notes.
Rows: 200
Columns: 5
$ row <int> 878, 1713, 1815, 2858, 3734, 4756, 5389, 5423, 5614, 5849, 60…
$ col <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
$ expected <chr> "11 columns", "11 columns", "11 columns", "11 columns", "11 c…
$ actual <chr> "12 columns", "12 columns", "12 columns", "12 columns", "12 c…
$ file <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
# A tibble: 5 × 4
expected actual col n
<chr> <chr> <int> <int>
1 11 columns "12 columns" 12 196
2 a double "0.5`" 6 1
3 a double "2631600 " 6 1
4 a double "2`" 6 1
5 a double "8`" 6 1
[1] "datetime" "city" "state"
[4] "country" "shape" "duration (seconds)"
[7] "duration (hours/min)" "comments" "date posted"
[10] "latitude" "longitude"
Rows: 88,875
Columns: 11
$ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/…
$ city <chr> "san marcos", "lackland afb", "chester (uk/engl…
$ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al…
$ country <chr> "us", NA, "gb", "us", "us", "us", "gb", "us", "…
$ shape <chr> "cylinder", "light", "circle", "circle", "light…
$ `duration (seconds)` <dbl> 2700, 7200, 20, 20, 900, 300, 180, 1200, 180, 1…
$ `duration (hours/min)` <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hou…
$ comments <chr> "This event took place in early fall around 194…
$ `date posted` <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2…
$ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333",…
$ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.…
The slice function can show us particular row numbers
# A tibble: 4 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 30894 6 a double "2`" ""
2 39616 6 a double "8`" ""
3 45691 6 a double "2631600 " ""
4 65125 6 a double "0.5`" ""
The slice function can show us particular row numbers
Rows: 1
Columns: 11
$ datetime <chr> "2/2/2000 19:33"
$ city <chr> "bouse"
$ state <chr> "az"
$ country <chr> "us"
$ shape <chr> NA
$ `duration (seconds)` <dbl> NA
$ `duration (hours/min)` <chr> "each a few seconds"
$ comments <chr> "Driving through Plomosa Pass towards Bouse Loo…
$ `date posted` <chr> "2/16/2000"
$ latitude <chr> "33.9325000"
$ longitude <chr> "-114.0050000"
Now we have a chance to keep but clean these values! We will read in
duration (seconds)
now as a character type. Use
?read_csv
to see documentation about special ways of
reading in data.
url <-
"https://sisbid.github.io/Data-Wrangling/data/ufo/ufo_data_complete.csv"
ufo <-read_csv(url, col_types = cols(`duration (seconds)` = "c"))
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Looks like an extra column for these rows.
# A tibble: 1 × 4
expected actual col n
<chr> <chr> <int> <int>
1 11 columns 12 columns 12 196
Our previous problematic rows have values instead of NAs
Rows: 1
Columns: 11
$ datetime <chr> "2/2/2000 19:33"
$ city <chr> "bouse"
$ state <chr> "az"
$ country <chr> "us"
$ shape <chr> NA
$ `duration (seconds)` <chr> "2`"
$ `duration (hours/min)` <chr> "each a few seconds"
$ comments <chr> "Driving through Plomosa Pass towards Bouse Loo…
$ `date posted` <chr> "2/16/2000"
$ latitude <chr> "33.9325000"
$ longitude <chr> "-114.0050000"
Multiply by negative one to drop the rows. Use the slice
function to “select” those rows based on the index. Need to offset for
-1
because problems()
gives us the index row
based on the raw data, not the read in data (which has a header). We
will multiple by negative one to select out those rows as well.
# A tibble: 2 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 878 12 11 columns 12 columns ""
2 1713 12 11 columns 12 columns ""
[1] 877 1712 1814 2857 3733 4755
[1] -877 -1712 -1814 -2857 -3733 -4755
[1] 196
# A tibble: 1 × 3
expected actual n
<chr> <chr> <int>
1 11 columns 12 columns 196
clean_names()
function from the
janitor
package [1] "datetime" "city" "state"
[4] "country" "shape" "duration (seconds)"
[7] "duration (hours/min)" "comments" "date posted"
[10] "latitude" "longitude"
[1] "datetime" "city" "state"
[4] "country" "shape" "duration_seconds"
[7] "duration_hours_min" "comments" "date_posted"
[10] "latitude" "longitude"
within mutate...
recode(variable, value_old = value_new,
other_value_old = other_value_new)
Rows: 88,679
Columns: 11
$ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/1955…
$ city <chr> "san marcos", "lackland afb", "chester (uk/england)…
$ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al", "…
$ country <chr> "us", NA, "Great Britain", "us", "us", "us", "Great…
$ shape <chr> "cylinder", "light", "circle", "circle", "light", "…
$ duration_seconds <chr> "2700", "7200", "20", "20", "900", "300", "180", "1…
$ duration_hours_min <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hour", …
$ comments <chr> "This event took place in early fall around 1949-50…
$ date_posted <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2004"…
$ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333", "21…
$ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.6458…
ufo_clean %>% mutate(country =
recode(country,
gb = "Great Britain",
us = "United States")) %>%
glimpse()
Rows: 88,679
Columns: 11
$ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/1955…
$ city <chr> "san marcos", "lackland afb", "chester (uk/england)…
$ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al", "…
$ country <chr> "United States", NA, "Great Britain", "United State…
$ shape <chr> "cylinder", "light", "circle", "circle", "light", "…
$ duration_seconds <chr> "2700", "7200", "20", "20", "900", "300", "180", "1…
$ duration_hours_min <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hour", …
$ comments <chr> "This event took place in early fall around 1949-50…
$ date_posted <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2004"…
$ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333", "21…
$ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.6458…
# A tibble: 6 × 2
country n
<chr> <int>
1 au 593
2 ca 3266
3 de 112
4 gb 2050
5 us 70293
6 <NA> 12365
case_when()
regions to create a new variable based on
conditions of other variablescase_when(test ~ value if test is true,
test2 ~ vlue if test2 is true,
TRUE ~ value if all above tests are not true) # defaults to NA
ufo_clean <- ufo_clean %>% mutate(
region = case_when(
country %in% c("us", "ca") ~ "North America",
country %in% c("de") ~ "Europe",
country %in% "gb" ~ "Great Britain",
TRUE ~ "Other"
))
ufo_clean %>% select(country, region) %>% head()
# A tibble: 6 × 2
country region
<chr> <chr>
1 us North America
2 <NA> Other
3 gb Great Britain
4 us North America
5 us North America
6 us North America
case_when
- another exampleThe TRUE
value can also just be the original values.
ufo_clean %>% mutate(country = case_when(
country == "gb" ~ "Great Britain",
country == "us" ~"United States",
country == "au" ~ "Australia",
country == "DE" ~ "Germany",
TRUE ~ country))%>%
glimpse()
Rows: 88,679
Columns: 12
$ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/1955…
$ city <chr> "san marcos", "lackland afb", "chester (uk/england)…
$ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al", "…
$ country <chr> "United States", NA, "Great Britain", "United State…
$ shape <chr> "cylinder", "light", "circle", "circle", "light", "…
$ duration_seconds <chr> "2700", "7200", "20", "20", "900", "300", "180", "1…
$ duration_hours_min <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hour", …
$ comments <chr> "This event took place in early fall around 1949-50…
$ date_posted <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2004"…
$ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333", "21…
$ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.6458…
$ region <chr> "North America", "Other", "Great Britain", "North A…
recode
makes exact swapscase_when
can use conditionals, need to specify what
value for if no conditions are met (can be the original value of a
variable if we use the variable name).Sometimes country
is NA
even though
state
is known. A conditional more flexible recoding would
be helpful…
# A tibble: 6 × 12
datetime city state country shape duration_seconds duration_hours_min
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 10/10/1949 20:30 san … tx us cyli… 2700 45 minutes
2 10/10/1949 21:00 lack… tx <NA> light 7200 1-2 hrs
3 10/10/1955 17:00 ches… <NA> gb circ… 20 20 seconds
4 10/10/1956 21:00 edna tx us circ… 20 1/2 hour
5 10/10/1960 20:00 kane… hi us light 900 15 minutes
6 10/10/1961 19:00 bris… tn us sphe… 300 5 minutes
# ℹ 5 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
# longitude <chr>, region <chr>
Looking at city… it seems like many of these are in fact in the US.
# A tibble: 2 × 3
country state n
<chr> <chr> <int>
1 us tx 3742
2 <NA> tx 299
# A tibble: 299 × 1
city
<chr>
1 lackland afb
2 mercedies
3 texas city/galveston
4 houston/tomball
5 bettendorf
6 dallas/ft. worth (mansfield)
7 halletsville
8 gulf of mexico
9 haltom
10 aubrey/frisco
# ℹ 289 more rows
# A tibble: 2 × 3
country state n
<chr> <chr> <int>
1 us ut 659
2 <NA> ut 138
# A tibble: 138 × 1
city
<chr>
1 canyonlands np
2 ogden/clinton
3 sandy
4 salt lake valley
5 sandy
6 duchenne
7 west valley
8 salt flats
9 west valley
10 west valley
# ℹ 128 more rows
[1] "ak" "al" "ar" "az" "ca" "co" "ct" "dc" "de" "fl" "ga" "hi" "ia" "id" "il"
[16] "in" "ks" "ky" "la" "ma" "md" "me" "mi" "mn" "mo" "ms" "mt" "nc" "nd" "ne"
[31] "nh" "nj" "nm" "nv" "ny" "oh" "ok" "or" "pa" "pr" "ri" "sc" "sd" "tn" "tx"
[46] "ut" "va" "vt" "wa" "wi" "wv" "wy"
[1] "ab" "bc" "mb" "nb" "nf" "ns" "nt" "on" "pe" "pq" "qc" "sa" "sk" "yk" "yt"
[16] NA
[1] "bc" "la" "ms" "nc" "ns" "nt" "ri" "sk" "tn" "wv" "yt" NA
A small overlap with US states.
[1] "al" "dc" "nt" "oh" "sa" "wa" "yt" NA
Some overlap with US states.
The setdiff()
function can show us what is unique or
different for the first of 2 listed sets.
[1] 1 2
[1] "a" "b"
[1] "ak" "al" "ar" "az" "ca" "co" "ct" "dc" "de" "fl" "ga" "hi" "ia" "id" "il"
[16] "in" "ks" "ky" "la" "ma" "md" "me" "mi" "mn" "mo" "ms" "mt" "nc" "nd" "ne"
[31] "nh" "nj" "nm" "nv" "ny" "oh" "ok" "or" "pa" "pr" "ri" "sc" "sd" "tn" "tx"
[46] "ut" "va" "vt" "wa" "wi" "wv" "wy"
[1] "al" "dc" "nt" "oh" "sa" "wa" "yt" NA "bc" "la" "ms" "nc" "ns" "nt" "ri"
[16] "sk" "tn" "wv" "yt" NA "ab" "bc" "mb" "nb" "nf" "ns" "nt" "on" "pe" "pq"
[31] "qc" "sa" "sk" "yk" "yt" NA
[1] "ak" "ar" "az" "ca" "co" "ct" "de" "fl" "ga" "hi" "ia" "id" "il" "in" "ks"
[16] "ky" "ma" "md" "me" "mi" "mn" "mo" "mt" "nd" "ne" "nh" "nj" "nm" "nv" "ny"
[31] "ok" "or" "pa" "pr" "sc" "sd" "tx" "ut" "va" "vt" "wi" "wy"
# A tibble: 4 × 2
country n
<chr> <int>
1 au 593
2 ca 3266
3 de 112
4 gb 2050
Let’s make an assumption that if the state value is within the data as a state for a specific country, than it comes from that country for the sake of illustration.
ufo_clean <- ufo_clean %>% mutate(prob_country =
case_when((is.na(country) & state %in% c(US_states)) ~ "United States",
(is.na(country) & state %in% c(CA_states)) ~ "Canada",
(is.na(country) & state %in% c(AU_states)) ~ "Australia",
(is.na(country) & state %in% c(GB_states)) ~ "Great Britain",
TRUE ~ country))
# A tibble: 10 × 2
prob_country n
<chr> <int>
1 Australia 694
2 Canada 536
3 Great Britain 5296
4 United States 5838
5 au 593
6 ca 3266
7 de 112
8 gb 2050
9 us 70293
10 <NA> 1
Take a look at those NAs.
# A tibble: 1 × 13
datetime city state country shape duration_seconds duration_hours_min
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 10/25/1997 22:00 st. … vi <NA> light 8 5-8 secds
# ℹ 6 more variables: comments <chr>, date_posted <chr>, latitude <chr>,
# longitude <chr>, region <chr>, prob_country <chr>
[1] "berlin (germany)"
[2] "berlin (germany)"
[3] "obernheim (germany)"
[4] "ottersberg (germany)"
[5] "urbach (germany)"
[6] "bremen (30 km south of) (germany)"
[7] "sembach (germany)"
[8] "magdeburg (germany)"
[9] "neuruppin (germany)"
[10] "lampertheim (germany)"
[11] "ramstein (germany)"
[12] "bremen (germany)"
[13] "nurenburg (germany)"
[14] "senftenberg (germany)"
[15] "schwalmtal (germany)"
[16] "neuss (germany)"
[17] "babenhausen (germany)"
[18] "berlin (germany)"
[19] "mittenwald (germany)"
[20] "ransbach-baumbach (germany)"
[21] "ansbach (germany)"
[22] "miesau (germany)"
[23] "bensheim (germany)"
[24] "muenster (germany)"
[25] "chemnitz (germany)"
[26] "kirchzell (germany)"
[27] "bremen (germany)"
[28] "wildflecken (germany)"
[29] "munich (germany)"
[30] "baumholder (germany)"
[31] "zirndorf (west germany)"
[32] "hamburg (germany)"
[33] "langenleiten (germany)"
[34] "baumholder (germany)"
[35] "zehdenick (germany)"
[36] "hanau (germany)"
[37] "berlin (germany)"
[38] "aachen (near cologne) (germany)"
[39] "munich (oberschliessheim army airfield) (germany)"
[40] "munich (near) (germany)"
[41] "bremen (germany)"
[42] "berlin (germany)"
[43] "bad pyrmont (germany)"
[44] "freiburg (germany)"
[45] "frankfurt am main (germany)"
[46] "siegen (germany)"
[47] "erlangen (germany)"
[48] "koblenz (westerwald mountains near) (germany)"
[49] "osnabruck (germany)"
[50] "kelsterbach (germany)"
[51] "trier (germany)"
[52] "thulba (germany)"
[53] "elbingen (germany)"
[54] "bocholt (germany)"
[55] "emmelshausen (germany)"
[56] "darmstadt (germany)"
[57] "stuttgart (germany)"
[58] "berlin (germany)"
[59] "ansbach (germany)"
[60] "frankfurt (germany)"
[61] "dresden (germany)"
[62] "mainz (germany)"
[63] "werder (havel) (germany)"
[64] "schweinfurt (west germany)"
[65] "emlichheim (germany)"
[66] "staufen (germany)"
[67] "neuseddin (potsdam)(germany)"
[68] "mannheim (west germany)"
[69] "schafhausen (germany)"
[70] "berlin (germany)"
[71] "erfurt (thuringia, germany)"
[72] "munich (germany)"
[73] "waldorf (west germany)"
[74] "bamberg (germany/bavaria)"
[75] "fulda (near) (germany)"
[76] "hamburg (germany)"
[77] "ansbach (germany)"
[78] "dresden (germany)"
[79] "bierenbachtal (germany)"
[80] "kassel (germany) (on highway)"
[81] "bamberg (germany)"
[82] "maugenhard (germany)"
[83] "aschersleben (germany)"
[84] "regensburg (germany)"
[85] "berlin (germany)"
[86] "berlin (germany)"
[87] "ramstein (germany)"
[88] "bochum (germany)"
[89] "mainz (germany)"
[90] "berlin (germany)"
[91] "neumarkt (germany)"
[92] "munich (germany)"
[93] "biesenthal (germany)"
[94] "haus (germany)"
[95] "freiburg (germany)"
[96] "obernheim (germany)"
[97] "weissenburg (germany)"
[98] "bitburg (germany)"
[99] "berlin (germany)"
[100] "heidelberg (germany)"
[101] "hannover (germany)"
[102] "schwetzingen (germany)"
[103] "buchholz (germany)"
[104] "cologne (germany)"
[105] "weiden (ne bavaria) (germany)"
[106] "grafenhausen (germany)"
[107] "heilbronn (germany)"
[108] "gelsenkirchen (germany)"
[109] "neckarsulm (germany)"
[110] "kelsterbach (germany)"
[111] "mannheim (germany)"
[112] "kaiserlautern (germany)"
ufo_clean <- ufo_clean %>% mutate(prob_country =
case_when(
(is.na(country) & state %in% c(US_states)) |
country == "us" ~ "United States",
(is.na(country) & state %in% c(CA_states)) |
country == "ca" ~ "Canada",
(is.na(country) & state %in% c(AU_states)) |
country == "au" ~ "Australia",
(is.na(country) & state %in% c(GB_states)) |
country == "gb" ~ "Great Britain",
country == "de" ~ "Germany",
TRUE ~ country))
We would want to confirm what we recoded with the cities and latitude and longitude, especially to deal with the overlaps in the state lists.
# A tibble: 10 × 3
country prob_country n
<chr> <chr> <int>
1 au Australia 593
2 ca Canada 3266
3 de Germany 112
4 gb Great Britain 2050
5 us United States 70293
6 <NA> Australia 694
7 <NA> Canada 536
8 <NA> Great Britain 5296
9 <NA> United States 5838
10 <NA> <NA> 1