For example, let’s say gender was coded as Male, M, m, Female, F, f. Using Excel to find all of these would be a matter of filtering and changing all by hand or using if statements.
Sometimes though, it’s not so simple. That’s where functions that find patterns come to be very useful.
table(gender)
gender F FeMAle FEMALE Fm M Ma mAle Male MaLe MALE Man 80 88 76 87 99 76 84 83 79 93 84 Woman 71
In R, you could use case_when()
:
#case_when way: data_gen <-data_gen %>% mutate(gender = case_when(gender %in% c("Male", "M", "m", "Man") ~ "Male", TRUE ~ gender)) head(data_gen)
# A tibble: 6 × 1 gender <chr> 1 F 2 Fm 3 MaLe 4 MaLe 5 FeMAle 6 FEMALE
Oh dear! This only fixes some values! It is difficult to notice values like "MaLe"
.
stringr
packageLike dplyr
, the stringr
package:
str_
data.frame
in dplyr
Useful String functions from base R and stringr
toupper()
, tolower()
- uppercase or lowercase your data
str_sentence()
- uppercase just the first character (in the stringr
package)
paste()
- paste strings together with a space
paste0
- paste strings together with no space as default
str_trim()
(in the stringr
package) or trimws
in base
nchar
- get the number of characters in a string
str_to_sentence()
#case_when way: data_gen <-data_gen %>% mutate(gender = str_to_sentence(gender)) %>% mutate(gender = case_when(gender %in% c("Male", "M", "m", "Man") ~ "Male", TRUE ~ gender)) head(data_gen)
# A tibble: 6 × 1 gender <chr> 1 F 2 Fm 3 Male 4 Male 5 Female 6 Female
OK, now we are getting somewhere!
Now we have a chance to keep but clean these values!
ufo <-read_csv("https://sisbid.github.io/Data-Wrangling/data/ufo/ufo_data_complete.csv", col_types = cols(`duration (seconds)` = "c"))
Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.: dat <- vroom(...) problems(dat)
p <- problems(ufo) ufo_clean <- ufo %>% slice((pull(p, row))*-1)
clean_names()
function from the janitor
packagecolnames(ufo_clean)
[1] "datetime" "city" "state" [4] "country" "shape" "duration (seconds)" [7] "duration (hours/min)" "comments" "date posted" [10] "latitude" "longitude"
ufo_clean <- clean_names(ufo_clean) colnames(ufo_clean)
[1] "datetime" "city" "state" [4] "country" "shape" "duration_seconds" [7] "duration_hours_min" "comments" "date_posted" [10] "latitude" "longitude"
Now let’s fix our ufo data and remove those pesky backticks in the duration_seconds
variable. First let’s find them with str_detect
.
ufo_clean %>% filter(str_detect( string = duration_seconds, pattern = "`"))
# A tibble: 3 × 11 datetime city state country shape durat…¹ durat…² comme…³ date_…⁴ latit…⁵ <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 2/2/2000 19… bouse az us <NA> 2` each a… Drivin… 2/16/2… 33.932… 2 4/10/2005 2… sant… ca us <NA> 8` eight … 2 red … 4/16/2… 36.974… 3 7/21/2006 1… ibag… <NA> <NA> circ… 0.5` 1/2 se… Viajab… 10/30/… 4.4406… # … with 1 more variable: longitude <chr>, and abbreviated variable names # ¹duration_seconds, ²duration_hours_min, ³comments, ⁴date_posted, ⁵latitude
ufo_clean <- ufo_clean %>% mutate(duration_seconds = str_remove(string = duration_seconds, pattern = "`"))
ufo_clean <- ufo_clean %>% mutate(duration_seconds = as.numeric(duration_seconds)) glimpse(ufo_clean)
Rows: 88,679 Columns: 11 $ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/1955… $ city <chr> "san marcos", "lackland afb", "chester (uk/england)… $ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al", "… $ country <chr> "us", NA, "gb", "us", "us", "us", "gb", "us", "us",… $ shape <chr> "cylinder", "light", "circle", "circle", "light", "… $ duration_seconds <dbl> 2700, 7200, 20, 20, 900, 300, 180, 1200, 180, 120, … $ duration_hours_min <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hour", … $ comments <chr> "This event took place in early fall around 1949-50… $ date_posted <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2004"… $ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333", "21… $ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.6458…
ufo_clean %>% mutate(duration_seconds = paste(duration_seconds, "sec", sep = " ")) %>% glimpse()
Rows: 88,679 Columns: 11 $ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/1955… $ city <chr> "san marcos", "lackland afb", "chester (uk/england)… $ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al", "… $ country <chr> "us", NA, "gb", "us", "us", "us", "gb", "us", "us",… $ shape <chr> "cylinder", "light", "circle", "circle", "light", "… $ duration_seconds <chr> "2700 sec", "7200 sec", "20 sec", "20 sec", "900 se… $ duration_hours_min <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hour", … $ comments <chr> "This event took place in early fall around 1949-50… $ date_posted <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2004"… $ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333", "21… $ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.6458…
stringr
str_sub(x, start, end)
- substrings from position start to position endExamples:
str_sub("I like friesian horses", 8,12)
[1] "fries"
#123456789101112 #I like fries str_sub(c("Site A", "Site B", "Site C"), 6,6)
[1] "A" "B" "C"
.
matches any single character*
means repeat as many (even if 0) more times the last character?
makes the last thing optional^
matches start of vector ^a
- starts with “a”$
matches end of vector b$
- ends with “b”stringr
str_detect
, str_subset
, str_replace
, and str_replace_all
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
str_detect
- returns TRUE
if pattern
is foundstr_subset
- returns only the strings which pattern were detectedstr_extract
- returns only the pattern which were detectedstr_replace
- replaces pattern
with replacement
the first timestr_replace_all
- replaces pattern
with replacement
as many times matchedThese are the indices where the pattern match occurs:
ufo_clean %>% filter(str_detect(comments, "two aliens")) %>% head()
# A tibble: 2 × 11 datetime city state country shape durat…¹ durat…² comme…³ date_…⁴ latit…⁵ <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> 1 10/14/2006 … yuma va us form… 300 5 minu… ((HOAX… 4/27/2… 36.615… 2 7/1/2007 23… nort… ct <NA> unkn… 60 1 minu… Witnes… 10/19/… 41.985… # … with 1 more variable: longitude <chr>, and abbreviated variable names # ¹duration_seconds, ²duration_hours_min, ³comments, ⁴date_posted, ⁵latitude
ufo_clean %>% filter(str_detect(comments, "two aliens")) %>% select(comments)
# A tibble: 2 × 1 comments <chr> 1 ((HOAX??)) two aliens appeared from a bright light to peacefully investigate… 2 Witnessed two aliens walking along baseball field fence.
str_subset()
is easierstr_subset()
gives the values that match the pattern:
ufo_clean %>% pull(comments) %>% str_subset( "two aliens")
[1] "((HOAX??)) two aliens appeared from a bright light to peacefully investigate the surroundings in the woods" [2] "Witnessed two aliens walking along baseball field fence."
str_extract
str_extract
extracts just the matched string
ufo_clean %>% mutate(aliens = str_extract(comments, "aliens")) %>% count(aliens)
# A tibble: 2 × 2 aliens n <chr> <int> 1 aliens 53 2 <NA> 88626
ufo_clean %>% pull(comments) %>%str_subset( "^aliens")
[1] "aliens speak german???" "aliens exist" "aliens in srilanka"
That contains space then ship maybe with stuff in between
ufo_clean %>% pull(comments) %>% str_subset("space.?ship") %>% head(4) # gets "spaceship" or "space ship" or...
[1] "I saw the cylinder shaped looked like a spaceship hovring above the east side of the Air Force base. Saw it for about 30 seconds and ra" [2] "description of a spaceship spotted over Birmingham Alabama in 1967." [3] "A space ship was descending to the ground" [4] "On Monday october 3, 2005, I spotted two spaceships in the sky. The first spotted ship was what seemed to be a bright star, when it st"
ufo_clean %>% pull(comments) %>% str_subset("space.ship") %>% head(4) # no "spaceship" must have character in bw
[1] "A space ship was descending to the ground" [2] "I saw a Silver space ship rising into the early morning sky over Houston, Texas." [3] "Saw a space ship hanging over the southern (Manzano) portion of the Sandia Mountains on evening. It was brightly lit, but not entirely." [4] "saw space ship for 5 min! Got scared crapless!!!!!!!!!!!!!!!! happened in arizona"
Let’s say we wanted to make the time information more consistent. Using case_when()
would be very tedious and error-prone!
We can use str_replace()
to do so.
ufo_clean %>% mutate(duration_hours_min = str_replace(string = duration_hours_min, pattern = "minutes", replacement ="mins")) %>% pull(duration_hours_min) %>% head(8)
[1] "45 mins" "1-2 hrs" "20 seconds" "1/2 hour" "15 mins" [6] "5 mins" "about 3 mins" "20 mins"
Better yet, you might notice that this data isn’t tidy- there are more than two entries for each value - amount of time and unit. We could separate this using separate()
from the tidyr
package.
ufo_clean %>% separate(duration_hours_min, into = c("duration_amount", "duration_unit"), sep = " ") %>% select(duration_amount, duration_unit) %>% head()
# A tibble: 6 × 2 duration_amount duration_unit <chr> <chr> 1 45 minutes 2 1-2 hrs 3 20 seconds 4 1/2 hour 5 15 minutes 6 5 minutes
As you can see there is still plenty of cleaning to do!
ufo_clean %>% separate(date_posted, into = c("date_posted", "time_posted"), sep = " ") %>% select(date_posted, time_posted) %>% head()
# A tibble: 6 × 2 date_posted time_posted <chr> <chr> 1 4/27/2004 <NA> 2 12/16/2005 <NA> 3 1/21/2008 <NA> 4 1/17/2004 <NA> 5 1/22/2004 <NA> 6 4/27/2007 <NA>
The [lubridate
](https://lubridate.tidyverse.org/) package is amazing for dates. Most important functions are those that look like ymd or mdy etc. They specify how a date should be interpreted.
library(lubridate)#need to load this one! ufo_clean <- ufo_clean %>% mutate(date_posted = mdy(date_posted))
Warning: 193 failed to parse.
head(ufo_clean)
# A tibble: 6 × 11 datetime city state country shape durat…¹ durat…² comme…³ date_pos…⁴ latit…⁵ <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <date> <chr> 1 10/10/19… san … tx us cyli… 2700 45 min… This e… 2004-04-27 29.883… 2 10/10/19… lack… tx <NA> light 7200 1-2 hrs 1949 L… 2005-12-16 29.384… 3 10/10/19… ches… <NA> gb circ… 20 20 sec… Green/… 2008-01-21 53.2 4 10/10/19… edna tx us circ… 20 1/2 ho… My old… 2004-01-17 28.978… 5 10/10/19… kane… hi us light 900 15 min… AS a M… 2004-01-22 21.418… 6 10/10/19… bris… tn us sphe… 300 5 minu… My fat… 2007-04-27 36.595… # … with 1 more variable: longitude <chr>, and abbreviated variable names # ¹duration_seconds, ²duration_hours_min, ³comments, ⁴date_posted, ⁵latitude
str_detect(string = c("abcdd", "two"), pattern = "dd")
[1] TRUE FALSE
str_subset(string = c("abcdd", "two"), pattern = "dd")
[1] "abcdd"
str_extract(string = c("abcdd", "two"), pattern = "dd")
[1] "dd" NA
str_sub(string = c("abcdd", "two"), start = 1, end = 3)
[1] "abc" "two"
stringr
package has lots of helpful functions that work on vectors or variables in a data framestr_detect
helps find patternsstr_detect
and filter
can help you filter data based on patterns within valuestr_extract
helps extract a patternstr_sub
extracts pieces of strings based on the position of the the charactersstr_subset
gives the values that match a patternseparate
can separate columns into two^
indicates the start of a string$
indicates the end of a stringlubridate
package is useful for dates and timeshttps://sisbid.github.io/Data-Wrangling/labs/data-cleaning-lab-part2.Rmd