One of the most important aspects of data cleaning is missing values.
Types of “missing” data:
NA
- general missing dataNaN
- stands for “Not a Number”, happens when you do 0/0.Inf
and -Inf
- Infinity, happens when you take a positive number (or negative number) by 0.Logical operations return NA
for NA
values. Think about it, the data could be > 2
or not we don’t know, so R
says there is no TRUE
or FALSE
, so that is missing:
x = c(0, NA, 2, 3, 4) x > 2
[1] FALSE NA FALSE TRUE TRUE
Similarly with logicals, operations/arithmetic with NA
will result in NA
s:
x + 2
[1] 2 NA 4 5 6
x * 2
[1] 0 NA 4 6 8
Recall that mathematical operations with NA
often result in NA
s.
sum(c(1,2,3,NA))
[1] NA
mean(c(1,2,3,NA))
[1] NA
median(c(1,2,3,NA))
[1] NA
Also true when we combine mathematical operations and logicals. Recall that TRUE
is evaluated as 1 and FALSE
is evaluated as 0.
x <- c(TRUE, TRUE, TRUE, TRUE, FALSE, NA) sum(x)
[1] NA
sum(x, na.rm = TRUE)
[1] 4
is.na
- looks for NAN
and NA
is.nan
- looks for NAN
is.infinite
- looks for Inf or -Inftest <- c(0,NA, -1) test/0
[1] NaN NA -Inf
test <- test/0 is.na(test)
[1] TRUE TRUE FALSE
is.nan(test)
[1] TRUE FALSE FALSE
is.infinite(test)
[1] FALSE FALSE TRUE
Do we have any NA
s? (any
can help)
A <- c(1, 2, 3, NA) B <- c(1, 2, 3, 4) any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE
Are all the values NA
? (all
can help)
A <- c(1, 2, 3, NA) B <- c(1, 2, 3, 4) all(is.na(A)) # are there any NAs - YES/TRUE
[1] FALSE
all(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE
NA
values with count()
Check the values for your variables, are they what you expect?
count()
is a great option because it gives you:
Check if rare values make sense.
library(readr) bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv") count(bike, subType)
# A tibble: 4 × 2 subType n <chr> <int> 1 STCLN 1 2 STRALY 3 3 STRPRD 1623 4 <NA> 4
naniar
Sometimes you need to look at lots of data though… the naniar
package is a good option.
#install.packages("naniar") library(naniar)
The airquality
dataset comes with R about air quality in New York in 1973.
?airquality # use this to find out more about the data
pct_complete()
This can tell you if there are missing values in the dataset.
pct_complete(airquality)
[1] 95.20697
Or for a particular variable:
airquality %>% select(Ozone) %>% pct_complete()
[1] 75.81699
miss_var_summary()
To get the percent missing (and counts) for each variable as a table, use this function.
miss_var_summary(airquality)
# A tibble: 6 × 3 variable n_miss pct_miss <chr> <int> <dbl> 1 Ozone 37 24.2 2 Solar.R 7 4.58 3 Wind 0 0 4 Temp 0 0 5 Month 0 0 6 Day 0 0
miss_case_summary
which rows have missing data in ordermiss_case_summary(airquality)
# A tibble: 153 × 3 case n_miss pct_miss <int> <int> <dbl> 1 5 2 33.3 2 27 2 33.3 3 6 1 16.7 4 10 1 16.7 5 11 1 16.7 6 25 1 16.7 7 26 1 16.7 8 32 1 16.7 9 33 1 16.7 10 34 1 16.7 # ℹ 143 more rows
naniar
plotsThe gg_miss_var()
function creates a nice plot about the number of missing values for each variable, (need a data frame).
gg_miss_var(airquality)
Be careful with missing data using subsetting!
filter()
removes missing values by default. Because R can’t tell for sure if an NA
value meets the condition. To keep them need to add is.na()
conditional.
Think about if this is OK or not - it depends on your data!
What if NA values represent values that are so low it is undetectable?
Filter will drop them from the data.
airquality %>% filter(Ozone < 5)
Ozone Solar.R Wind Temp Month Day 1 1 8 9.7 59 5 21 2 4 25 9.7 61 5 23
is.na()
can help us keep them.
airquality %>% filter(Ozone < 5 | is.na(Ozone))
Ozone Solar.R Wind Temp Month Day 1 NA NA 14.3 56 5 5 2 NA 194 8.6 69 5 10 3 1 8 9.7 59 5 21 4 4 25 9.7 61 5 23 5 NA 66 16.6 57 5 25 6 NA 266 14.9 58 5 26 7 NA NA 8.0 57 5 27 8 NA 286 8.6 78 6 1 9 NA 287 9.7 74 6 2 10 NA 242 16.1 67 6 3 11 NA 186 9.2 84 6 4 12 NA 220 8.6 85 6 5 13 NA 264 14.3 79 6 6 14 NA 273 6.9 87 6 8 15 NA 259 10.9 93 6 11 16 NA 250 9.2 92 6 12 17 NA 332 13.8 80 6 14 18 NA 322 11.5 79 6 15 19 NA 150 6.3 77 6 21 20 NA 59 1.7 76 6 22 21 NA 91 4.6 76 6 23 22 NA 250 6.3 76 6 24 23 NA 135 8.0 75 6 25 24 NA 127 8.0 78 6 26 25 NA 47 10.3 73 6 27 26 NA 98 11.5 80 6 28 27 NA 31 14.9 77 6 29 28 NA 138 8.0 83 6 30 29 NA 101 10.9 84 7 4 30 NA 139 8.6 82 7 11 31 NA 291 14.9 91 7 14 32 NA 258 9.7 81 7 22 33 NA 295 11.5 82 7 23 34 NA 222 8.6 92 8 10 35 NA 137 11.5 86 8 11 36 NA 64 11.5 79 8 15 37 NA 255 12.6 75 8 23 38 NA 153 5.7 88 8 27 39 NA 145 13.2 77 9 27
NA
values for a variable use drop_na()
A function from the tidyr
package. (Need a data frame to start!)
Disclaimer: Don’t do this unless you have thought about if dropping NA
values makes sense based on knowing what these values mean in your data.
dim(airquality)
[1] 153 6
airquality %>% drop_na(Ozone)
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 28 NA 14.9 66 5 6 6 23 299 8.6 65 5 7 7 19 99 13.8 59 5 8 8 8 19 20.1 61 5 9 9 7 NA 6.9 74 5 11 10 16 256 9.7 69 5 12 11 11 290 9.2 66 5 13 12 14 274 10.9 68 5 14 13 18 65 13.2 58 5 15 14 14 334 11.5 64 5 16 15 34 307 12.0 66 5 17 16 6 78 18.4 57 5 18 17 30 322 11.5 68 5 19 18 11 44 9.7 62 5 20 19 1 8 9.7 59 5 21 20 11 320 16.6 73 5 22 21 4 25 9.7 61 5 23 22 32 92 12.0 61 5 24 23 23 13 12.0 67 5 28 24 45 252 14.9 81 5 29 25 115 223 5.7 79 5 30 26 37 279 7.4 76 5 31 27 29 127 9.7 82 6 7 28 71 291 13.8 90 6 9 29 39 323 11.5 87 6 10 30 23 148 8.0 82 6 13 31 21 191 14.9 77 6 16 32 37 284 20.7 72 6 17 33 20 37 9.2 65 6 18 34 12 120 11.5 73 6 19 35 13 137 10.3 76 6 20 36 135 269 4.1 84 7 1 37 49 248 9.2 85 7 2 38 32 236 9.2 81 7 3 39 64 175 4.6 83 7 5 40 40 314 10.9 83 7 6 41 77 276 5.1 88 7 7 42 97 267 6.3 92 7 8 43 97 272 5.7 92 7 9 44 85 175 7.4 89 7 10 45 10 264 14.3 73 7 12 46 27 175 14.9 81 7 13 47 7 48 14.3 80 7 15 48 48 260 6.9 81 7 16 49 35 274 10.3 82 7 17 50 61 285 6.3 84 7 18 51 79 187 5.1 87 7 19 52 63 220 11.5 85 7 20 53 16 7 6.9 74 7 21 54 80 294 8.6 86 7 24 55 108 223 8.0 85 7 25 56 20 81 8.6 82 7 26 57 52 82 12.0 86 7 27 58 82 213 7.4 88 7 28 59 50 275 7.4 86 7 29 60 64 253 7.4 83 7 30 61 59 254 9.2 81 7 31 62 39 83 6.9 81 8 1 63 9 24 13.8 81 8 2 64 16 77 7.4 82 8 3 65 78 NA 6.9 86 8 4 66 35 NA 7.4 85 8 5 67 66 NA 4.6 87 8 6 68 122 255 4.0 89 8 7 69 89 229 10.3 90 8 8 70 110 207 8.0 90 8 9 71 44 192 11.5 86 8 12 72 28 273 11.5 82 8 13 73 65 157 9.7 80 8 14 74 22 71 10.3 77 8 16 75 59 51 6.3 79 8 17 76 23 115 7.4 76 8 18 77 31 244 10.9 78 8 19 78 44 190 10.3 78 8 20 79 21 259 15.5 77 8 21 80 9 36 14.3 72 8 22 81 45 212 9.7 79 8 24 82 168 238 3.4 81 8 25 83 73 215 8.0 86 8 26 84 76 203 9.7 97 8 28 85 118 225 2.3 94 8 29 86 84 237 6.3 96 8 30 87 85 188 6.3 94 8 31 88 96 167 6.9 91 9 1 89 78 197 5.1 92 9 2 90 73 183 2.8 93 9 3 91 91 189 4.6 93 9 4 92 47 95 7.4 87 9 5 93 32 92 15.5 84 9 6 94 20 252 10.9 80 9 7 95 23 220 10.3 78 9 8 96 21 230 10.9 75 9 9 97 24 259 9.7 73 9 10 98 44 236 14.9 81 9 11 99 21 259 15.5 76 9 12 100 28 238 6.3 77 9 13 101 9 24 10.9 71 9 14 102 13 112 11.5 71 9 15 103 46 237 6.9 78 9 16 104 18 224 13.8 67 9 17 105 13 27 10.3 76 9 18 106 24 238 10.3 68 9 19 107 16 201 8.0 82 9 20 108 13 238 12.6 64 9 21 109 23 14 9.2 71 9 22 110 36 139 10.3 81 9 23 111 7 49 10.3 69 9 24 112 14 20 16.6 63 9 25 113 30 193 6.9 70 9 26 114 14 191 14.3 75 9 28 115 18 131 8.0 76 9 29 116 20 223 11.5 68 9 30
NA
values for a data frame use drop_na()
This function of the tidyr
package drops rows with any missing data in any column when used on a df.
airquality %>% drop_na()
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 23 299 8.6 65 5 7 6 19 99 13.8 59 5 8 7 8 19 20.1 61 5 9 8 16 256 9.7 69 5 12 9 11 290 9.2 66 5 13 10 14 274 10.9 68 5 14 11 18 65 13.2 58 5 15 12 14 334 11.5 64 5 16 13 34 307 12.0 66 5 17 14 6 78 18.4 57 5 18 15 30 322 11.5 68 5 19 16 11 44 9.7 62 5 20 17 1 8 9.7 59 5 21 18 11 320 16.6 73 5 22 19 4 25 9.7 61 5 23 20 32 92 12.0 61 5 24 21 23 13 12.0 67 5 28 22 45 252 14.9 81 5 29 23 115 223 5.7 79 5 30 24 37 279 7.4 76 5 31 25 29 127 9.7 82 6 7 26 71 291 13.8 90 6 9 27 39 323 11.5 87 6 10 28 23 148 8.0 82 6 13 29 21 191 14.9 77 6 16 30 37 284 20.7 72 6 17 31 20 37 9.2 65 6 18 32 12 120 11.5 73 6 19 33 13 137 10.3 76 6 20 34 135 269 4.1 84 7 1 35 49 248 9.2 85 7 2 36 32 236 9.2 81 7 3 37 64 175 4.6 83 7 5 38 40 314 10.9 83 7 6 39 77 276 5.1 88 7 7 40 97 267 6.3 92 7 8 41 97 272 5.7 92 7 9 42 85 175 7.4 89 7 10 43 10 264 14.3 73 7 12 44 27 175 14.9 81 7 13 45 7 48 14.3 80 7 15 46 48 260 6.9 81 7 16 47 35 274 10.3 82 7 17 48 61 285 6.3 84 7 18 49 79 187 5.1 87 7 19 50 63 220 11.5 85 7 20 51 16 7 6.9 74 7 21 52 80 294 8.6 86 7 24 53 108 223 8.0 85 7 25 54 20 81 8.6 82 7 26 55 52 82 12.0 86 7 27 56 82 213 7.4 88 7 28 57 50 275 7.4 86 7 29 58 64 253 7.4 83 7 30 59 59 254 9.2 81 7 31 60 39 83 6.9 81 8 1 61 9 24 13.8 81 8 2 62 16 77 7.4 82 8 3 63 122 255 4.0 89 8 7 64 89 229 10.3 90 8 8 65 110 207 8.0 90 8 9 66 44 192 11.5 86 8 12 67 28 273 11.5 82 8 13 68 65 157 9.7 80 8 14 69 22 71 10.3 77 8 16 70 59 51 6.3 79 8 17 71 23 115 7.4 76 8 18 72 31 244 10.9 78 8 19 73 44 190 10.3 78 8 20 74 21 259 15.5 77 8 21 75 9 36 14.3 72 8 22 76 45 212 9.7 79 8 24 77 168 238 3.4 81 8 25 78 73 215 8.0 86 8 26 79 76 203 9.7 97 8 28 80 118 225 2.3 94 8 29 81 84 237 6.3 96 8 30 82 85 188 6.3 94 8 31 83 96 167 6.9 91 9 1 84 78 197 5.1 92 9 2 85 73 183 2.8 93 9 3 86 91 189 4.6 93 9 4 87 47 95 7.4 87 9 5 88 32 92 15.5 84 9 6 89 20 252 10.9 80 9 7 90 23 220 10.3 78 9 8 91 21 230 10.9 75 9 9 92 24 259 9.7 73 9 10 93 44 236 14.9 81 9 11 94 21 259 15.5 76 9 12 95 28 238 6.3 77 9 13 96 9 24 10.9 71 9 14 97 13 112 11.5 71 9 15 98 46 237 6.9 78 9 16 99 18 224 13.8 67 9 17 100 13 27 10.3 76 9 18 101 24 238 10.3 68 9 19 102 16 201 8.0 82 9 20 103 13 238 12.6 64 9 21 104 23 14 9.2 71 9 22 105 36 139 10.3 81 9 23 106 7 49 10.3 69 9 24 107 14 20 16.6 63 9 25 108 30 193 6.9 70 9 26 109 14 191 14.3 75 9 28 110 18 131 8.0 76 9 29 111 20 223 11.5 68 9 30
Use the miss_var_which()
function from naniar
miss_var_which(airquality) # which columns have missing values
[1] "Ozone" "Solar.R"
miss_var_which
and function from naniar
(need a data frame)
airquality %>% select(!miss_var_which(airquality))
Wind Temp Month Day 1 7.4 67 5 1 2 8.0 72 5 2 3 12.6 74 5 3 4 11.5 62 5 4 5 14.3 56 5 5 6 14.9 66 5 6 7 8.6 65 5 7 8 13.8 59 5 8 9 20.1 61 5 9 10 8.6 69 5 10 11 6.9 74 5 11 12 9.7 69 5 12 13 9.2 66 5 13 14 10.9 68 5 14 15 13.2 58 5 15 16 11.5 64 5 16 17 12.0 66 5 17 18 18.4 57 5 18 19 11.5 68 5 19 20 9.7 62 5 20 21 9.7 59 5 21 22 16.6 73 5 22 23 9.7 61 5 23 24 12.0 61 5 24 25 16.6 57 5 25 26 14.9 58 5 26 27 8.0 57 5 27 28 12.0 67 5 28 29 14.9 81 5 29 30 5.7 79 5 30 31 7.4 76 5 31 32 8.6 78 6 1 33 9.7 74 6 2 34 16.1 67 6 3 35 9.2 84 6 4 36 8.6 85 6 5 37 14.3 79 6 6 38 9.7 82 6 7 39 6.9 87 6 8 40 13.8 90 6 9 41 11.5 87 6 10 42 10.9 93 6 11 43 9.2 92 6 12 44 8.0 82 6 13 45 13.8 80 6 14 46 11.5 79 6 15 47 14.9 77 6 16 48 20.7 72 6 17 49 9.2 65 6 18 50 11.5 73 6 19 51 10.3 76 6 20 52 6.3 77 6 21 53 1.7 76 6 22 54 4.6 76 6 23 55 6.3 76 6 24 56 8.0 75 6 25 57 8.0 78 6 26 58 10.3 73 6 27 59 11.5 80 6 28 60 14.9 77 6 29 61 8.0 83 6 30 62 4.1 84 7 1 63 9.2 85 7 2 64 9.2 81 7 3 65 10.9 84 7 4 66 4.6 83 7 5 67 10.9 83 7 6 68 5.1 88 7 7 69 6.3 92 7 8 70 5.7 92 7 9 71 7.4 89 7 10 72 8.6 82 7 11 73 14.3 73 7 12 74 14.9 81 7 13 75 14.9 91 7 14 76 14.3 80 7 15 77 6.9 81 7 16 78 10.3 82 7 17 79 6.3 84 7 18 80 5.1 87 7 19 81 11.5 85 7 20 82 6.9 74 7 21 83 9.7 81 7 22 84 11.5 82 7 23 85 8.6 86 7 24 86 8.0 85 7 25 87 8.6 82 7 26 88 12.0 86 7 27 89 7.4 88 7 28 90 7.4 86 7 29 91 7.4 83 7 30 92 9.2 81 7 31 93 6.9 81 8 1 94 13.8 81 8 2 95 7.4 82 8 3 96 6.9 86 8 4 97 7.4 85 8 5 98 4.6 87 8 6 99 4.0 89 8 7 100 10.3 90 8 8 101 8.0 90 8 9 102 8.6 92 8 10 103 11.5 86 8 11 104 11.5 86 8 12 105 11.5 82 8 13 106 9.7 80 8 14 107 11.5 79 8 15 108 10.3 77 8 16 109 6.3 79 8 17 110 7.4 76 8 18 111 10.9 78 8 19 112 10.3 78 8 20 113 15.5 77 8 21 114 14.3 72 8 22 115 12.6 75 8 23 116 9.7 79 8 24 117 3.4 81 8 25 118 8.0 86 8 26 119 5.7 88 8 27 120 9.7 97 8 28 121 2.3 94 8 29 122 6.3 96 8 30 123 6.3 94 8 31 124 6.9 91 9 1 125 5.1 92 9 2 126 2.8 93 9 3 127 4.6 93 9 4 128 7.4 87 9 5 129 15.5 84 9 6 130 10.9 80 9 7 131 10.3 78 9 8 132 10.9 75 9 9 133 9.7 73 9 10 134 14.9 81 9 11 135 15.5 76 9 12 136 6.3 77 9 13 137 10.9 71 9 14 138 11.5 71 9 15 139 6.9 78 9 16 140 13.8 67 9 17 141 10.3 76 9 18 142 10.3 68 9 19 143 8.0 82 9 20 144 12.6 64 9 21 145 9.2 71 9 22 146 10.3 81 9 23 147 10.3 69 9 24 148 16.6 63 9 25 149 6.9 70 9 26 150 13.2 77 9 27 151 14.3 75 9 28 152 8.0 76 9 29 153 11.5 68 9 30
NA
Let’s say we think that all 0 values should be NA
.
library(readr) bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv") count(bike, dateInstalled)
# A tibble: 9 × 2 dateInstalled n <dbl> <int> 1 0 126 2 2006 2 3 2007 368 4 2008 206 5 2009 86 6 2010 625 7 2011 101 8 2012 107 9 2013 10
NA
The na_if()
function of dplyr
can be helpful for changing all 0 values to NA
.
bike <- bike %>% mutate(dateInstalled = na_if(dateInstalled, 0)) count(bike, dateInstalled)
# A tibble: 9 × 2 dateInstalled n <dbl> <int> 1 2006 2 2 2007 368 3 2008 206 4 2009 86 5 2010 625 6 2011 101 7 2012 107 8 2013 10 9 NA 126
NA
to be a valueThe replace_na()
function (part of the tidyr
package), can do the opposite of na_if()
. (note that you must use numeric values as replacement - we will show how to replace with character strings soon)
bike %>% mutate(dateInstalled = replace_na(dateInstalled, 2005)) %>% count(dateInstalled)
# A tibble: 9 × 2 dateInstalled n <dbl> <int> 1 2005 126 2 2006 2 3 2007 368 4 2008 206 5 2009 86 6 2010 625 7 2011 101 8 2012 107 9 2013 10
NA
THINK ABOUT YOUR DATA FIRST!
⚠️ Sometimes removing NA
values leads to distorted math - be careful!
⚠️ Think about what your NA
means for your data (are you sure ?).
Is an NA
for values so low they could not be reported?
Or is it if it was too low and also if there was a different issue (like no one reported)?
NA
If it is something more like a zero then you might want it included in your data like a zero instead of an NA
.
Example: - survey reports NA
if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week
⚠️ You might want to keep the NA
values so that you know the original sample size.
⚠️ Calculating percentages will give you a different result depending on your choice to include NA values.!
This is because the denominator changes.
NA
count(bike, dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 9 × 3 dateInstalled n percent <dbl> <int> <dbl> 1 2006 2 0.123 2 2007 368 22.6 3 2008 206 12.6 4 2009 86 5.27 5 2010 625 38.3 6 2011 101 6.19 7 2012 107 6.56 8 2013 10 0.613 9 NA 126 7.73
NA
bike %>% drop_na(dateInstalled) %>% count(dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 8 × 3 dateInstalled n percent <dbl> <int> <dbl> 1 2006 2 0.133 2 2007 368 24.5 3 2008 206 13.7 4 2009 86 5.71 5 2010 625 41.5 6 2011 101 6.71 7 2012 107 7.11 8 2013 10 0.664
Should you be dividing by the total count with NA
values included?
It depends on your data and what NA
might mean.
Pay attention to your data and your NA
values!
is.na()
,any(is.na())
, count()
, and functions from naniar
like gg_miss_var()
can help determine if we have NA
valuesfilter()
automatically removes NA
values - can’t confirm or deny if condition is met (need | is.na()
to keep them)drop_na()
can help you remove NA
values from a variable or an entire data frameNA
values can change your calculation resultsNA
values represent - don’t drop them if you shouldn’t