One of the most important aspects of data cleaning is missing values.
Types of “missing” data:
NA
- general missing dataNaN
- stands for “Not a Number”, happens when you do 0/0.Inf
and -Inf
- Infinity, happens when you take a positive number (or negative number) by 0.Logical operations return NA
for NA
values. Think about it, the data could be > 2
or not we don’t know, so R
says there is no TRUE
or FALSE
, so that is missing:
x <- c(0, NA, 2, 3, 4) x > 2
[1] FALSE NA FALSE TRUE TRUE
Mathematical operations with NA
result in NA
s.
y <- c(1,2,3,NA) sum(y)
[1] NA
mean(y)
[1] NA
Logicals: TRUE
is evaluated as 1 and FALSE
is evaluated as 0.
x <- c(TRUE, TRUE, TRUE, TRUE, FALSE, NA) sum(x)
[1] NA
sum(x, na.rm = TRUE)
[1] 4
is.na
- looks for NAN
and NA
is.nan
- looks for NAN
test <- c(0,NA, -1, NaN) is.na(test)
[1] FALSE TRUE FALSE TRUE
is.nan(test)
[1] FALSE FALSE FALSE TRUE
Do we have any NA
s? (any
can help)
A <- c(1, 2, 3, NA) B <- c(1, 2, 3, 4)
any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE
NA
values with count()
Check the values for your variables, are they what you expect?
count()
is a great option because it gives you:
Check if rare values make sense.
NA
values with count()
library(readr) bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv") count(bike, subType)
# A tibble: 4 × 2 subType n <chr> <int> 1 STCLN 1 2 STRALY 3 3 STRPRD 1623 4 <NA> 4
naniar
Sometimes you need to look at lots of data… the naniar
package is a good option.
#install.packages("naniar") library(naniar)
The airquality
dataset comes with R about air quality in New York in 1973.
?airquality # use this to find out more about the data
pct_complete()
This can tell you if there are missing values in the dataset.
pct_complete(airquality)
[1] 95.20697
Or for a particular variable:
airquality %>% select(Ozone) %>% pct_complete()
[1] 75.81699
miss_var_summary()
To get the percent missing (and counts) for each variable as a table, use this function.
miss_var_summary(airquality)
# A tibble: 6 × 3 variable n_miss pct_miss <chr> <int> <num> 1 Ozone 37 24.2 2 Solar.R 7 4.58 3 Wind 0 0 4 Temp 0 0 5 Month 0 0 6 Day 0 0
miss_case_summary
which rows (IDs) have missing datamiss_case_summary(airquality)
# A tibble: 153 × 3 case n_miss pct_miss <int> <int> <dbl> 1 5 2 33.3 2 27 2 33.3 3 6 1 16.7 4 10 1 16.7 5 11 1 16.7 6 25 1 16.7 7 26 1 16.7 8 32 1 16.7 9 33 1 16.7 10 34 1 16.7 # ℹ 143 more rows
naniar
plotsThe gg_miss_var()
function creates a nice plot about the number of missing values for each variable, (need a data frame).
gg_miss_var(airquality)
Be careful with missing data using subsetting!
filter()
removes missing values by default. Because R can’t tell for sure if an NA
value meets the condition. To keep them need to add is.na()
conditional.
Think about if this is OK or not - it depends on your data!
What if NA values represent values that are so low it is undetectable?
Filter will drop them from the data.
airquality %>% filter(Ozone < 5)
Ozone Solar.R Wind Temp Month Day 1 1 8 9.7 59 5 21 2 4 25 9.7 61 5 23
is.na()
can help us keep them.
airquality %>% filter(Ozone < 5 | is.na(Ozone))
Ozone Solar.R Wind Temp Month Day 1 NA NA 14.3 56 5 5 2 NA 194 8.6 69 5 10 3 1 8 9.7 59 5 21 4 4 25 9.7 61 5 23 5 NA 66 16.6 57 5 25 6 NA 266 14.9 58 5 26 7 NA NA 8.0 57 5 27 8 NA 286 8.6 78 6 1 9 NA 287 9.7 74 6 2 10 NA 242 16.1 67 6 3 11 NA 186 9.2 84 6 4 12 NA 220 8.6 85 6 5 13 NA 264 14.3 79 6 6 14 NA 273 6.9 87 6 8 15 NA 259 10.9 93 6 11 16 NA 250 9.2 92 6 12 17 NA 332 13.8 80 6 14 18 NA 322 11.5 79 6 15 19 NA 150 6.3 77 6 21 20 NA 59 1.7 76 6 22 21 NA 91 4.6 76 6 23 22 NA 250 6.3 76 6 24 23 NA 135 8.0 75 6 25 24 NA 127 8.0 78 6 26 25 NA 47 10.3 73 6 27 26 NA 98 11.5 80 6 28 27 NA 31 14.9 77 6 29 28 NA 138 8.0 83 6 30 29 NA 101 10.9 84 7 4 30 NA 139 8.6 82 7 11 31 NA 291 14.9 91 7 14 32 NA 258 9.7 81 7 22 33 NA 295 11.5 82 7 23 34 NA 222 8.6 92 8 10 35 NA 137 11.5 86 8 11 36 NA 64 11.5 79 8 15 37 NA 255 12.6 75 8 23 38 NA 153 5.7 88 8 27 39 NA 145 13.2 77 9 27
NA
values for a variable use drop_na()
A function from the tidyr
package. (Need a data frame to start!)
Disclaimer: Don’t do this unless you have thought about if dropping NA
values makes sense based on knowing what these values mean in your data.
airquality
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 11 7 NA 6.9 74 5 11 12 16 256 9.7 69 5 12 13 11 290 9.2 66 5 13 14 14 274 10.9 68 5 14 15 18 65 13.2 58 5 15 16 14 334 11.5 64 5 16 17 34 307 12.0 66 5 17 18 6 78 18.4 57 5 18 19 30 322 11.5 68 5 19 20 11 44 9.7 62 5 20 21 1 8 9.7 59 5 21 22 11 320 16.6 73 5 22 23 4 25 9.7 61 5 23 24 32 92 12.0 61 5 24 25 NA 66 16.6 57 5 25 26 NA 266 14.9 58 5 26 27 NA NA 8.0 57 5 27 28 23 13 12.0 67 5 28 29 45 252 14.9 81 5 29 30 115 223 5.7 79 5 30 31 37 279 7.4 76 5 31 32 NA 286 8.6 78 6 1 33 NA 287 9.7 74 6 2 34 NA 242 16.1 67 6 3 35 NA 186 9.2 84 6 4 36 NA 220 8.6 85 6 5 37 NA 264 14.3 79 6 6 38 29 127 9.7 82 6 7 39 NA 273 6.9 87 6 8 40 71 291 13.8 90 6 9 41 39 323 11.5 87 6 10 42 NA 259 10.9 93 6 11 43 NA 250 9.2 92 6 12 44 23 148 8.0 82 6 13 45 NA 332 13.8 80 6 14 46 NA 322 11.5 79 6 15 47 21 191 14.9 77 6 16 48 37 284 20.7 72 6 17 49 20 37 9.2 65 6 18 50 12 120 11.5 73 6 19 51 13 137 10.3 76 6 20 52 NA 150 6.3 77 6 21 53 NA 59 1.7 76 6 22 54 NA 91 4.6 76 6 23 55 NA 250 6.3 76 6 24 56 NA 135 8.0 75 6 25 57 NA 127 8.0 78 6 26 58 NA 47 10.3 73 6 27 59 NA 98 11.5 80 6 28 60 NA 31 14.9 77 6 29 61 NA 138 8.0 83 6 30 62 135 269 4.1 84 7 1 63 49 248 9.2 85 7 2 64 32 236 9.2 81 7 3 65 NA 101 10.9 84 7 4 66 64 175 4.6 83 7 5 67 40 314 10.9 83 7 6 68 77 276 5.1 88 7 7 69 97 267 6.3 92 7 8 70 97 272 5.7 92 7 9 71 85 175 7.4 89 7 10 72 NA 139 8.6 82 7 11 73 10 264 14.3 73 7 12 74 27 175 14.9 81 7 13 75 NA 291 14.9 91 7 14 76 7 48 14.3 80 7 15 77 48 260 6.9 81 7 16 78 35 274 10.3 82 7 17 79 61 285 6.3 84 7 18 80 79 187 5.1 87 7 19 81 63 220 11.5 85 7 20 82 16 7 6.9 74 7 21 83 NA 258 9.7 81 7 22 84 NA 295 11.5 82 7 23 85 80 294 8.6 86 7 24 86 108 223 8.0 85 7 25 87 20 81 8.6 82 7 26 88 52 82 12.0 86 7 27 89 82 213 7.4 88 7 28 90 50 275 7.4 86 7 29 91 64 253 7.4 83 7 30 92 59 254 9.2 81 7 31 93 39 83 6.9 81 8 1 94 9 24 13.8 81 8 2 95 16 77 7.4 82 8 3 96 78 NA 6.9 86 8 4 97 35 NA 7.4 85 8 5 98 66 NA 4.6 87 8 6 99 122 255 4.0 89 8 7 100 89 229 10.3 90 8 8 101 110 207 8.0 90 8 9 102 NA 222 8.6 92 8 10 103 NA 137 11.5 86 8 11 104 44 192 11.5 86 8 12 105 28 273 11.5 82 8 13 106 65 157 9.7 80 8 14 107 NA 64 11.5 79 8 15 108 22 71 10.3 77 8 16 109 59 51 6.3 79 8 17 110 23 115 7.4 76 8 18 111 31 244 10.9 78 8 19 112 44 190 10.3 78 8 20 113 21 259 15.5 77 8 21 114 9 36 14.3 72 8 22 115 NA 255 12.6 75 8 23 116 45 212 9.7 79 8 24 117 168 238 3.4 81 8 25 118 73 215 8.0 86 8 26 119 NA 153 5.7 88 8 27 120 76 203 9.7 97 8 28 121 118 225 2.3 94 8 29 122 84 237 6.3 96 8 30 123 85 188 6.3 94 8 31 124 96 167 6.9 91 9 1 125 78 197 5.1 92 9 2 126 73 183 2.8 93 9 3 127 91 189 4.6 93 9 4 128 47 95 7.4 87 9 5 129 32 92 15.5 84 9 6 130 20 252 10.9 80 9 7 131 23 220 10.3 78 9 8 132 21 230 10.9 75 9 9 133 24 259 9.7 73 9 10 134 44 236 14.9 81 9 11 135 21 259 15.5 76 9 12 136 28 238 6.3 77 9 13 137 9 24 10.9 71 9 14 138 13 112 11.5 71 9 15 139 46 237 6.9 78 9 16 140 18 224 13.8 67 9 17 141 13 27 10.3 76 9 18 142 24 238 10.3 68 9 19 143 16 201 8.0 82 9 20 144 13 238 12.6 64 9 21 145 23 14 9.2 71 9 22 146 36 139 10.3 81 9 23 147 7 49 10.3 69 9 24 148 14 20 16.6 63 9 25 149 30 193 6.9 70 9 26 150 NA 145 13.2 77 9 27 151 14 191 14.3 75 9 28 152 18 131 8.0 76 9 29 153 20 223 11.5 68 9 30
NA
values for a variable use drop_na()
airquality %>% drop_na(Ozone)
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 28 NA 14.9 66 5 6 6 23 299 8.6 65 5 7 7 19 99 13.8 59 5 8 8 8 19 20.1 61 5 9 9 7 NA 6.9 74 5 11 10 16 256 9.7 69 5 12 11 11 290 9.2 66 5 13 12 14 274 10.9 68 5 14 13 18 65 13.2 58 5 15 14 14 334 11.5 64 5 16 15 34 307 12.0 66 5 17 16 6 78 18.4 57 5 18 17 30 322 11.5 68 5 19 18 11 44 9.7 62 5 20 19 1 8 9.7 59 5 21 20 11 320 16.6 73 5 22 21 4 25 9.7 61 5 23 22 32 92 12.0 61 5 24 23 23 13 12.0 67 5 28 24 45 252 14.9 81 5 29 25 115 223 5.7 79 5 30 26 37 279 7.4 76 5 31 27 29 127 9.7 82 6 7 28 71 291 13.8 90 6 9 29 39 323 11.5 87 6 10 30 23 148 8.0 82 6 13 31 21 191 14.9 77 6 16 32 37 284 20.7 72 6 17 33 20 37 9.2 65 6 18 34 12 120 11.5 73 6 19 35 13 137 10.3 76 6 20 36 135 269 4.1 84 7 1 37 49 248 9.2 85 7 2 38 32 236 9.2 81 7 3 39 64 175 4.6 83 7 5 40 40 314 10.9 83 7 6 41 77 276 5.1 88 7 7 42 97 267 6.3 92 7 8 43 97 272 5.7 92 7 9 44 85 175 7.4 89 7 10 45 10 264 14.3 73 7 12 46 27 175 14.9 81 7 13 47 7 48 14.3 80 7 15 48 48 260 6.9 81 7 16 49 35 274 10.3 82 7 17 50 61 285 6.3 84 7 18 51 79 187 5.1 87 7 19 52 63 220 11.5 85 7 20 53 16 7 6.9 74 7 21 54 80 294 8.6 86 7 24 55 108 223 8.0 85 7 25 56 20 81 8.6 82 7 26 57 52 82 12.0 86 7 27 58 82 213 7.4 88 7 28 59 50 275 7.4 86 7 29 60 64 253 7.4 83 7 30 61 59 254 9.2 81 7 31 62 39 83 6.9 81 8 1 63 9 24 13.8 81 8 2 64 16 77 7.4 82 8 3 65 78 NA 6.9 86 8 4 66 35 NA 7.4 85 8 5 67 66 NA 4.6 87 8 6 68 122 255 4.0 89 8 7 69 89 229 10.3 90 8 8 70 110 207 8.0 90 8 9 71 44 192 11.5 86 8 12 72 28 273 11.5 82 8 13 73 65 157 9.7 80 8 14 74 22 71 10.3 77 8 16 75 59 51 6.3 79 8 17 76 23 115 7.4 76 8 18 77 31 244 10.9 78 8 19 78 44 190 10.3 78 8 20 79 21 259 15.5 77 8 21 80 9 36 14.3 72 8 22 81 45 212 9.7 79 8 24 82 168 238 3.4 81 8 25 83 73 215 8.0 86 8 26 84 76 203 9.7 97 8 28 85 118 225 2.3 94 8 29 86 84 237 6.3 96 8 30 87 85 188 6.3 94 8 31 88 96 167 6.9 91 9 1 89 78 197 5.1 92 9 2 90 73 183 2.8 93 9 3 91 91 189 4.6 93 9 4 92 47 95 7.4 87 9 5 93 32 92 15.5 84 9 6 94 20 252 10.9 80 9 7 95 23 220 10.3 78 9 8 96 21 230 10.9 75 9 9 97 24 259 9.7 73 9 10 98 44 236 14.9 81 9 11 99 21 259 15.5 76 9 12 100 28 238 6.3 77 9 13 101 9 24 10.9 71 9 14 102 13 112 11.5 71 9 15 103 46 237 6.9 78 9 16 104 18 224 13.8 67 9 17 105 13 27 10.3 76 9 18 106 24 238 10.3 68 9 19 107 16 201 8.0 82 9 20 108 13 238 12.6 64 9 21 109 23 14 9.2 71 9 22 110 36 139 10.3 81 9 23 111 7 49 10.3 69 9 24 112 14 20 16.6 63 9 25 113 30 193 6.9 70 9 26 114 14 191 14.3 75 9 28 115 18 131 8.0 76 9 29 116 20 223 11.5 68 9 30
drop_na()
Drops rows with any missing data in any column.
airquality %>% drop_na()
Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 23 299 8.6 65 5 7 6 19 99 13.8 59 5 8 7 8 19 20.1 61 5 9 8 16 256 9.7 69 5 12 9 11 290 9.2 66 5 13 10 14 274 10.9 68 5 14 11 18 65 13.2 58 5 15 12 14 334 11.5 64 5 16 13 34 307 12.0 66 5 17 14 6 78 18.4 57 5 18 15 30 322 11.5 68 5 19 16 11 44 9.7 62 5 20 17 1 8 9.7 59 5 21 18 11 320 16.6 73 5 22 19 4 25 9.7 61 5 23 20 32 92 12.0 61 5 24 21 23 13 12.0 67 5 28 22 45 252 14.9 81 5 29 23 115 223 5.7 79 5 30 24 37 279 7.4 76 5 31 25 29 127 9.7 82 6 7 26 71 291 13.8 90 6 9 27 39 323 11.5 87 6 10 28 23 148 8.0 82 6 13 29 21 191 14.9 77 6 16 30 37 284 20.7 72 6 17 31 20 37 9.2 65 6 18 32 12 120 11.5 73 6 19 33 13 137 10.3 76 6 20 34 135 269 4.1 84 7 1 35 49 248 9.2 85 7 2 36 32 236 9.2 81 7 3 37 64 175 4.6 83 7 5 38 40 314 10.9 83 7 6 39 77 276 5.1 88 7 7 40 97 267 6.3 92 7 8 41 97 272 5.7 92 7 9 42 85 175 7.4 89 7 10 43 10 264 14.3 73 7 12 44 27 175 14.9 81 7 13 45 7 48 14.3 80 7 15 46 48 260 6.9 81 7 16 47 35 274 10.3 82 7 17 48 61 285 6.3 84 7 18 49 79 187 5.1 87 7 19 50 63 220 11.5 85 7 20 51 16 7 6.9 74 7 21 52 80 294 8.6 86 7 24 53 108 223 8.0 85 7 25 54 20 81 8.6 82 7 26 55 52 82 12.0 86 7 27 56 82 213 7.4 88 7 28 57 50 275 7.4 86 7 29 58 64 253 7.4 83 7 30 59 59 254 9.2 81 7 31 60 39 83 6.9 81 8 1 61 9 24 13.8 81 8 2 62 16 77 7.4 82 8 3 63 122 255 4.0 89 8 7 64 89 229 10.3 90 8 8 65 110 207 8.0 90 8 9 66 44 192 11.5 86 8 12 67 28 273 11.5 82 8 13 68 65 157 9.7 80 8 14 69 22 71 10.3 77 8 16 70 59 51 6.3 79 8 17 71 23 115 7.4 76 8 18 72 31 244 10.9 78 8 19 73 44 190 10.3 78 8 20 74 21 259 15.5 77 8 21 75 9 36 14.3 72 8 22 76 45 212 9.7 79 8 24 77 168 238 3.4 81 8 25 78 73 215 8.0 86 8 26 79 76 203 9.7 97 8 28 80 118 225 2.3 94 8 29 81 84 237 6.3 96 8 30 82 85 188 6.3 94 8 31 83 96 167 6.9 91 9 1 84 78 197 5.1 92 9 2 85 73 183 2.8 93 9 3 86 91 189 4.6 93 9 4 87 47 95 7.4 87 9 5 88 32 92 15.5 84 9 6 89 20 252 10.9 80 9 7 90 23 220 10.3 78 9 8 91 21 230 10.9 75 9 9 92 24 259 9.7 73 9 10 93 44 236 14.9 81 9 11 94 21 259 15.5 76 9 12 95 28 238 6.3 77 9 13 96 9 24 10.9 71 9 14 97 13 112 11.5 71 9 15 98 46 237 6.9 78 9 16 99 18 224 13.8 67 9 17 100 13 27 10.3 76 9 18 101 24 238 10.3 68 9 19 102 16 201 8.0 82 9 20 103 13 238 12.6 64 9 21 104 23 14 9.2 71 9 22 105 36 139 10.3 81 9 23 106 7 49 10.3 69 9 24 107 14 20 16.6 63 9 25 108 30 193 6.9 70 9 26 109 14 191 14.3 75 9 28 110 18 131 8.0 76 9 29 111 20 223 11.5 68 9 30
Use the miss_var_which()
function from naniar
miss_var_which(airquality) # which columns have missing values
[1] "Ozone" "Solar.R"
miss_var_which
and function from naniar
(need a data frame)
airquality %>% select(!miss_var_which(airquality))
Wind Temp Month Day 1 7.4 67 5 1 2 8.0 72 5 2 3 12.6 74 5 3 4 11.5 62 5 4 5 14.3 56 5 5 6 14.9 66 5 6 7 8.6 65 5 7 8 13.8 59 5 8 9 20.1 61 5 9 10 8.6 69 5 10 11 6.9 74 5 11 12 9.7 69 5 12 13 9.2 66 5 13 14 10.9 68 5 14 15 13.2 58 5 15 16 11.5 64 5 16 17 12.0 66 5 17 18 18.4 57 5 18 19 11.5 68 5 19 20 9.7 62 5 20 21 9.7 59 5 21 22 16.6 73 5 22 23 9.7 61 5 23 24 12.0 61 5 24 25 16.6 57 5 25 26 14.9 58 5 26 27 8.0 57 5 27 28 12.0 67 5 28 29 14.9 81 5 29 30 5.7 79 5 30 31 7.4 76 5 31 32 8.6 78 6 1 33 9.7 74 6 2 34 16.1 67 6 3 35 9.2 84 6 4 36 8.6 85 6 5 37 14.3 79 6 6 38 9.7 82 6 7 39 6.9 87 6 8 40 13.8 90 6 9 41 11.5 87 6 10 42 10.9 93 6 11 43 9.2 92 6 12 44 8.0 82 6 13 45 13.8 80 6 14 46 11.5 79 6 15 47 14.9 77 6 16 48 20.7 72 6 17 49 9.2 65 6 18 50 11.5 73 6 19 51 10.3 76 6 20 52 6.3 77 6 21 53 1.7 76 6 22 54 4.6 76 6 23 55 6.3 76 6 24 56 8.0 75 6 25 57 8.0 78 6 26 58 10.3 73 6 27 59 11.5 80 6 28 60 14.9 77 6 29 61 8.0 83 6 30 62 4.1 84 7 1 63 9.2 85 7 2 64 9.2 81 7 3 65 10.9 84 7 4 66 4.6 83 7 5 67 10.9 83 7 6 68 5.1 88 7 7 69 6.3 92 7 8 70 5.7 92 7 9 71 7.4 89 7 10 72 8.6 82 7 11 73 14.3 73 7 12 74 14.9 81 7 13 75 14.9 91 7 14 76 14.3 80 7 15 77 6.9 81 7 16 78 10.3 82 7 17 79 6.3 84 7 18 80 5.1 87 7 19 81 11.5 85 7 20 82 6.9 74 7 21 83 9.7 81 7 22 84 11.5 82 7 23 85 8.6 86 7 24 86 8.0 85 7 25 87 8.6 82 7 26 88 12.0 86 7 27 89 7.4 88 7 28 90 7.4 86 7 29 91 7.4 83 7 30 92 9.2 81 7 31 93 6.9 81 8 1 94 13.8 81 8 2 95 7.4 82 8 3 96 6.9 86 8 4 97 7.4 85 8 5 98 4.6 87 8 6 99 4.0 89 8 7 100 10.3 90 8 8 101 8.0 90 8 9 102 8.6 92 8 10 103 11.5 86 8 11 104 11.5 86 8 12 105 11.5 82 8 13 106 9.7 80 8 14 107 11.5 79 8 15 108 10.3 77 8 16 109 6.3 79 8 17 110 7.4 76 8 18 111 10.9 78 8 19 112 10.3 78 8 20 113 15.5 77 8 21 114 14.3 72 8 22 115 12.6 75 8 23 116 9.7 79 8 24 117 3.4 81 8 25 118 8.0 86 8 26 119 5.7 88 8 27 120 9.7 97 8 28 121 2.3 94 8 29 122 6.3 96 8 30 123 6.3 94 8 31 124 6.9 91 9 1 125 5.1 92 9 2 126 2.8 93 9 3 127 4.6 93 9 4 128 7.4 87 9 5 129 15.5 84 9 6 130 10.9 80 9 7 131 10.3 78 9 8 132 10.9 75 9 9 133 9.7 73 9 10 134 14.9 81 9 11 135 15.5 76 9 12 136 6.3 77 9 13 137 10.9 71 9 14 138 11.5 71 9 15 139 6.9 78 9 16 140 13.8 67 9 17 141 10.3 76 9 18 142 10.3 68 9 19 143 8.0 82 9 20 144 12.6 64 9 21 145 9.2 71 9 22 146 10.3 81 9 23 147 10.3 69 9 24 148 16.6 63 9 25 149 6.9 70 9 26 150 13.2 77 9 27 151 14.3 75 9 28 152 8.0 76 9 29 153 11.5 68 9 30
NA
Let’s say we think that all 0 values should be NA
.
library(readr) bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv") bike <-bike %>% select(type, dateInstalled) bike
# A tibble: 1,631 × 2 type dateInstalled <chr> <dbl> 1 BIKE BOULEVARD 0 2 SIDEPATH 2010 3 SIGNED ROUTE 2010 4 SIDEPATH 0 5 BIKE LANE 2011 6 SIGNED ROUTE 2007 7 SIGNED ROUTE 2007 8 SIGNED ROUTE 2007 9 BIKE LANE 2009 10 SHARROW 2007 # ℹ 1,621 more rows
NA
The na_if()
function of dplyr
can be helpful for changing all 0 values to NA
.
More on mutate()
soon!
bike %>% mutate(dateInstalled = na_if(dateInstalled, 0))
# A tibble: 1,631 × 2 type dateInstalled <chr> <dbl> 1 BIKE BOULEVARD NA 2 SIDEPATH 2010 3 SIGNED ROUTE 2010 4 SIDEPATH NA 5 BIKE LANE 2011 6 SIGNED ROUTE 2007 7 SIGNED ROUTE 2007 8 SIGNED ROUTE 2007 9 BIKE LANE 2009 10 SHARROW 2007 # ℹ 1,621 more rows
NA
to be a valueThe replace_na()
function (part of the tidyr
package), can do the opposite of na_if()
.
Replacement much match the class of the other values in the column (e.g., character, numeric).
bike %>% mutate(dateInstalled = replace_na(dateInstalled, 2005))
# A tibble: 1,631 × 2 type dateInstalled <chr> <dbl> 1 BIKE BOULEVARD 0 2 SIDEPATH 2010 3 SIGNED ROUTE 2010 4 SIDEPATH 0 5 BIKE LANE 2011 6 SIGNED ROUTE 2007 7 SIGNED ROUTE 2007 8 SIGNED ROUTE 2007 9 BIKE LANE 2009 10 SHARROW 2007 # ℹ 1,621 more rows
NA
THINK ABOUT YOUR DATA FIRST!
⚠️ Sometimes removing NA
values leads to distorted math - be careful!
⚠️ Think about what your NA
means for your data (are you sure ?).
Is an NA
for values so low they could not be reported?
Or is it if it was too low and also if there was a different issue (like no one reported)?
NA
If it is something more like a zero then you might want it included in your data like a zero instead of an NA
.
Example: - survey reports NA
if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week
⚠️ You might want to keep the NA
values so that you know the original sample size.
⚠️ Calculating percentages will give you a different result depending on your choice to include NA values.!
This is because the denominator changes.
NA
count(bike, dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 9 × 3 dateInstalled n percent <dbl> <int> <dbl> 1 0 126 7.73 2 2006 2 0.123 3 2007 368 22.6 4 2008 206 12.6 5 2009 86 5.27 6 2010 625 38.3 7 2011 101 6.19 8 2012 107 6.56 9 2013 10 0.613
NA
bike %>% drop_na(dateInstalled) %>% count(dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 9 × 3 dateInstalled n percent <dbl> <int> <dbl> 1 0 126 7.73 2 2006 2 0.123 3 2007 368 22.6 4 2008 206 12.6 5 2009 86 5.27 6 2010 625 38.3 7 2011 101 6.19 8 2012 107 6.56 9 2013 10 0.613
Should you be dividing by the total count with NA
values included?
It depends on your data and what NA
might mean.
Pay attention to your data and your NA
values!
is.na()
,any(is.na())
, count()
, and functions from naniar
like gg_miss_var()
can help determine if we have NA
valuesfilter()
automatically removes NA
values - can’t confirm or deny if condition is met (need | is.na()
to keep them)drop_na()
can help you remove NA
values from a variable or an entire data frameNA
values can change your calculation resultsNA
values represent - don’t drop them if you shouldn’thttps://sisbid.github.io/Data-Wrangling/labs/missing-data-lab.Rmd