Dealing with Missing Data

Missing data types

One of the most important aspects of data cleaning is missing values.

Types of “missing” data:

  • NA - general missing data
  • NaN - stands for “Not a Number”, happens when you do 0/0.
  • Inf and -Inf - Infinity, happens when you take a positive number (or negative number) by 0.

Missing Data with Logicals

Logical operations return NA for NA values. Think about it, the data could be > 2 or not we don’t know, so R says there is no TRUE or FALSE, so that is missing:

x <- c(0, NA, 2, 3, 4)
x > 2
[1] FALSE    NA FALSE  TRUE  TRUE

Missing Data Issues

Mathematical operations with NA result in NAs.

y <- c(1,2,3,NA)
sum(y)
[1] NA
mean(y)
[1] NA

Missing Data Issues

Logicals: TRUE is evaluated as 1 and FALSE is evaluated as 0.

x <- c(TRUE, TRUE, TRUE, TRUE, FALSE, NA)
sum(x)
[1] NA
sum(x, na.rm = TRUE)
[1] 4

Finding Missing data

  • is.na - looks for NAN and NA
  • is.nan- looks for NAN
test <- c(0,NA, -1, NaN)
is.na(test)
[1] FALSE  TRUE FALSE  TRUE
is.nan(test)
[1] FALSE FALSE FALSE  TRUE

Useful checking functions

Do we have any NAs? (any can help)

A <- c(1, 2, 3, NA)
B <- c(1, 2, 3, 4)


any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE

Finding NA values with count()

Check the values for your variables, are they what you expect?

count() is a great option because it gives you:

  1. The unique values
  2. The amount of these values

Check if rare values make sense.

Finding NA values with count()

library(readr)
bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv")
count(bike, subType)
# A tibble: 4 × 2
  subType     n
  <chr>   <int>
1 STCLN       1
2 STRALY      3
3 STRPRD   1623
4 <NA>        4

naniar

Sometimes you need to look at lots of data… the naniar package is a good option.

#install.packages("naniar")
library(naniar)

Air quality data

The airquality dataset comes with R about air quality in New York in 1973.

?airquality # use this to find out more about the data

naniar: pct_complete()

This can tell you if there are missing values in the dataset.

pct_complete(airquality)
[1] 95.20697

Or for a particular variable:

airquality %>% select(Ozone) %>%
pct_complete()
[1] 75.81699

naniar:miss_var_summary()

To get the percent missing (and counts) for each variable as a table, use this function.

miss_var_summary(airquality)
# A tibble: 6 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 Ozone        37    24.2 
2 Solar.R       7     4.58
3 Wind          0     0   
4 Temp          0     0   
5 Month         0     0   
6 Day           0     0   

miss_case_summary which rows (IDs) have missing data

miss_case_summary(airquality)
# A tibble: 153 × 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ℹ 143 more rows

naniar plots

The gg_miss_var() function creates a nice plot about the number of missing values for each variable, (need a data frame).

gg_miss_var(airquality)

filter() and missing data

Be careful with missing data using subsetting!

filter() removes missing values by default. Because R can’t tell for sure if an NA value meets the condition. To keep them need to add is.na() conditional.

Think about if this is OK or not - it depends on your data!

filter() and missing data

What if NA values represent values that are so low it is undetectable?

Filter will drop them from the data.

airquality %>% filter(Ozone < 5)
  Ozone Solar.R Wind Temp Month Day
1     1       8  9.7   59     5  21
2     4      25  9.7   61     5  23

filter() and missing data

is.na() can help us keep them.

airquality %>% filter(Ozone < 5 | is.na(Ozone))
   Ozone Solar.R Wind Temp Month Day
1     NA      NA 14.3   56     5   5
2     NA     194  8.6   69     5  10
3      1       8  9.7   59     5  21
4      4      25  9.7   61     5  23
5     NA      66 16.6   57     5  25
6     NA     266 14.9   58     5  26
7     NA      NA  8.0   57     5  27
8     NA     286  8.6   78     6   1
9     NA     287  9.7   74     6   2
10    NA     242 16.1   67     6   3
11    NA     186  9.2   84     6   4
12    NA     220  8.6   85     6   5
13    NA     264 14.3   79     6   6
14    NA     273  6.9   87     6   8
15    NA     259 10.9   93     6  11
16    NA     250  9.2   92     6  12
17    NA     332 13.8   80     6  14
18    NA     322 11.5   79     6  15
19    NA     150  6.3   77     6  21
20    NA      59  1.7   76     6  22
21    NA      91  4.6   76     6  23
22    NA     250  6.3   76     6  24
23    NA     135  8.0   75     6  25
24    NA     127  8.0   78     6  26
25    NA      47 10.3   73     6  27
26    NA      98 11.5   80     6  28
27    NA      31 14.9   77     6  29
28    NA     138  8.0   83     6  30
29    NA     101 10.9   84     7   4
30    NA     139  8.6   82     7  11
31    NA     291 14.9   91     7  14
32    NA     258  9.7   81     7  22
33    NA     295 11.5   82     7  23
34    NA     222  8.6   92     8  10
35    NA     137 11.5   86     8  11
36    NA      64 11.5   79     8  15
37    NA     255 12.6   75     8  23
38    NA     153  5.7   88     8  27
39    NA     145 13.2   77     9  27

To remove rows with NA values for a variable use drop_na()

A function from the tidyr package. (Need a data frame to start!)

Disclaimer: Don’t do this unless you have thought about if dropping NA values makes sense based on knowing what these values mean in your data.

airquality
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
15     18      65 13.2   58     5  15
16     14     334 11.5   64     5  16
17     34     307 12.0   66     5  17
18      6      78 18.4   57     5  18
19     30     322 11.5   68     5  19
20     11      44  9.7   62     5  20
21      1       8  9.7   59     5  21
22     11     320 16.6   73     5  22
23      4      25  9.7   61     5  23
24     32      92 12.0   61     5  24
25     NA      66 16.6   57     5  25
26     NA     266 14.9   58     5  26
27     NA      NA  8.0   57     5  27
28     23      13 12.0   67     5  28
29     45     252 14.9   81     5  29
30    115     223  5.7   79     5  30
31     37     279  7.4   76     5  31
32     NA     286  8.6   78     6   1
33     NA     287  9.7   74     6   2
34     NA     242 16.1   67     6   3
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
37     NA     264 14.3   79     6   6
38     29     127  9.7   82     6   7
39     NA     273  6.9   87     6   8
40     71     291 13.8   90     6   9
41     39     323 11.5   87     6  10
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
44     23     148  8.0   82     6  13
45     NA     332 13.8   80     6  14
46     NA     322 11.5   79     6  15
47     21     191 14.9   77     6  16
48     37     284 20.7   72     6  17
49     20      37  9.2   65     6  18
50     12     120 11.5   73     6  19
51     13     137 10.3   76     6  20
52     NA     150  6.3   77     6  21
53     NA      59  1.7   76     6  22
54     NA      91  4.6   76     6  23
55     NA     250  6.3   76     6  24
56     NA     135  8.0   75     6  25
57     NA     127  8.0   78     6  26
58     NA      47 10.3   73     6  27
59     NA      98 11.5   80     6  28
60     NA      31 14.9   77     6  29
61     NA     138  8.0   83     6  30
62    135     269  4.1   84     7   1
63     49     248  9.2   85     7   2
64     32     236  9.2   81     7   3
65     NA     101 10.9   84     7   4
66     64     175  4.6   83     7   5
67     40     314 10.9   83     7   6
68     77     276  5.1   88     7   7
69     97     267  6.3   92     7   8
70     97     272  5.7   92     7   9
71     85     175  7.4   89     7  10
72     NA     139  8.6   82     7  11
73     10     264 14.3   73     7  12
74     27     175 14.9   81     7  13
75     NA     291 14.9   91     7  14
76      7      48 14.3   80     7  15
77     48     260  6.9   81     7  16
78     35     274 10.3   82     7  17
79     61     285  6.3   84     7  18
80     79     187  5.1   87     7  19
81     63     220 11.5   85     7  20
82     16       7  6.9   74     7  21
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
85     80     294  8.6   86     7  24
86    108     223  8.0   85     7  25
87     20      81  8.6   82     7  26
88     52      82 12.0   86     7  27
89     82     213  7.4   88     7  28
90     50     275  7.4   86     7  29
91     64     253  7.4   83     7  30
92     59     254  9.2   81     7  31
93     39      83  6.9   81     8   1
94      9      24 13.8   81     8   2
95     16      77  7.4   82     8   3
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
98     66      NA  4.6   87     8   6
99    122     255  4.0   89     8   7
100    89     229 10.3   90     8   8
101   110     207  8.0   90     8   9
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
104    44     192 11.5   86     8  12
105    28     273 11.5   82     8  13
106    65     157  9.7   80     8  14
107    NA      64 11.5   79     8  15
108    22      71 10.3   77     8  16
109    59      51  6.3   79     8  17
110    23     115  7.4   76     8  18
111    31     244 10.9   78     8  19
112    44     190 10.3   78     8  20
113    21     259 15.5   77     8  21
114     9      36 14.3   72     8  22
115    NA     255 12.6   75     8  23
116    45     212  9.7   79     8  24
117   168     238  3.4   81     8  25
118    73     215  8.0   86     8  26
119    NA     153  5.7   88     8  27
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4
128    47      95  7.4   87     9   5
129    32      92 15.5   84     9   6
130    20     252 10.9   80     9   7
131    23     220 10.3   78     9   8
132    21     230 10.9   75     9   9
133    24     259  9.7   73     9  10
134    44     236 14.9   81     9  11
135    21     259 15.5   76     9  12
136    28     238  6.3   77     9  13
137     9      24 10.9   71     9  14
138    13     112 11.5   71     9  15
139    46     237  6.9   78     9  16
140    18     224 13.8   67     9  17
141    13      27 10.3   76     9  18
142    24     238 10.3   68     9  19
143    16     201  8.0   82     9  20
144    13     238 12.6   64     9  21
145    23      14  9.2   71     9  22
146    36     139 10.3   81     9  23
147     7      49 10.3   69     9  24
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

To remove rows with NA values for a variable use drop_na()

airquality %>% drop_na(Ozone)
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      28      NA 14.9   66     5   6
6      23     299  8.6   65     5   7
7      19      99 13.8   59     5   8
8       8      19 20.1   61     5   9
9       7      NA  6.9   74     5  11
10     16     256  9.7   69     5  12
11     11     290  9.2   66     5  13
12     14     274 10.9   68     5  14
13     18      65 13.2   58     5  15
14     14     334 11.5   64     5  16
15     34     307 12.0   66     5  17
16      6      78 18.4   57     5  18
17     30     322 11.5   68     5  19
18     11      44  9.7   62     5  20
19      1       8  9.7   59     5  21
20     11     320 16.6   73     5  22
21      4      25  9.7   61     5  23
22     32      92 12.0   61     5  24
23     23      13 12.0   67     5  28
24     45     252 14.9   81     5  29
25    115     223  5.7   79     5  30
26     37     279  7.4   76     5  31
27     29     127  9.7   82     6   7
28     71     291 13.8   90     6   9
29     39     323 11.5   87     6  10
30     23     148  8.0   82     6  13
31     21     191 14.9   77     6  16
32     37     284 20.7   72     6  17
33     20      37  9.2   65     6  18
34     12     120 11.5   73     6  19
35     13     137 10.3   76     6  20
36    135     269  4.1   84     7   1
37     49     248  9.2   85     7   2
38     32     236  9.2   81     7   3
39     64     175  4.6   83     7   5
40     40     314 10.9   83     7   6
41     77     276  5.1   88     7   7
42     97     267  6.3   92     7   8
43     97     272  5.7   92     7   9
44     85     175  7.4   89     7  10
45     10     264 14.3   73     7  12
46     27     175 14.9   81     7  13
47      7      48 14.3   80     7  15
48     48     260  6.9   81     7  16
49     35     274 10.3   82     7  17
50     61     285  6.3   84     7  18
51     79     187  5.1   87     7  19
52     63     220 11.5   85     7  20
53     16       7  6.9   74     7  21
54     80     294  8.6   86     7  24
55    108     223  8.0   85     7  25
56     20      81  8.6   82     7  26
57     52      82 12.0   86     7  27
58     82     213  7.4   88     7  28
59     50     275  7.4   86     7  29
60     64     253  7.4   83     7  30
61     59     254  9.2   81     7  31
62     39      83  6.9   81     8   1
63      9      24 13.8   81     8   2
64     16      77  7.4   82     8   3
65     78      NA  6.9   86     8   4
66     35      NA  7.4   85     8   5
67     66      NA  4.6   87     8   6
68    122     255  4.0   89     8   7
69     89     229 10.3   90     8   8
70    110     207  8.0   90     8   9
71     44     192 11.5   86     8  12
72     28     273 11.5   82     8  13
73     65     157  9.7   80     8  14
74     22      71 10.3   77     8  16
75     59      51  6.3   79     8  17
76     23     115  7.4   76     8  18
77     31     244 10.9   78     8  19
78     44     190 10.3   78     8  20
79     21     259 15.5   77     8  21
80      9      36 14.3   72     8  22
81     45     212  9.7   79     8  24
82    168     238  3.4   81     8  25
83     73     215  8.0   86     8  26
84     76     203  9.7   97     8  28
85    118     225  2.3   94     8  29
86     84     237  6.3   96     8  30
87     85     188  6.3   94     8  31
88     96     167  6.9   91     9   1
89     78     197  5.1   92     9   2
90     73     183  2.8   93     9   3
91     91     189  4.6   93     9   4
92     47      95  7.4   87     9   5
93     32      92 15.5   84     9   6
94     20     252 10.9   80     9   7
95     23     220 10.3   78     9   8
96     21     230 10.9   75     9   9
97     24     259  9.7   73     9  10
98     44     236 14.9   81     9  11
99     21     259 15.5   76     9  12
100    28     238  6.3   77     9  13
101     9      24 10.9   71     9  14
102    13     112 11.5   71     9  15
103    46     237  6.9   78     9  16
104    18     224 13.8   67     9  17
105    13      27 10.3   76     9  18
106    24     238 10.3   68     9  19
107    16     201  8.0   82     9  20
108    13     238 12.6   64     9  21
109    23      14  9.2   71     9  22
110    36     139 10.3   81     9  23
111     7      49 10.3   69     9  24
112    14      20 16.6   63     9  25
113    30     193  6.9   70     9  26
114    14     191 14.3   75     9  28
115    18     131  8.0   76     9  29
116    20     223 11.5   68     9  30

Drop all NAs with drop_na()

Drops rows with any missing data in any column.

airquality %>% drop_na()
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      23     299  8.6   65     5   7
6      19      99 13.8   59     5   8
7       8      19 20.1   61     5   9
8      16     256  9.7   69     5  12
9      11     290  9.2   66     5  13
10     14     274 10.9   68     5  14
11     18      65 13.2   58     5  15
12     14     334 11.5   64     5  16
13     34     307 12.0   66     5  17
14      6      78 18.4   57     5  18
15     30     322 11.5   68     5  19
16     11      44  9.7   62     5  20
17      1       8  9.7   59     5  21
18     11     320 16.6   73     5  22
19      4      25  9.7   61     5  23
20     32      92 12.0   61     5  24
21     23      13 12.0   67     5  28
22     45     252 14.9   81     5  29
23    115     223  5.7   79     5  30
24     37     279  7.4   76     5  31
25     29     127  9.7   82     6   7
26     71     291 13.8   90     6   9
27     39     323 11.5   87     6  10
28     23     148  8.0   82     6  13
29     21     191 14.9   77     6  16
30     37     284 20.7   72     6  17
31     20      37  9.2   65     6  18
32     12     120 11.5   73     6  19
33     13     137 10.3   76     6  20
34    135     269  4.1   84     7   1
35     49     248  9.2   85     7   2
36     32     236  9.2   81     7   3
37     64     175  4.6   83     7   5
38     40     314 10.9   83     7   6
39     77     276  5.1   88     7   7
40     97     267  6.3   92     7   8
41     97     272  5.7   92     7   9
42     85     175  7.4   89     7  10
43     10     264 14.3   73     7  12
44     27     175 14.9   81     7  13
45      7      48 14.3   80     7  15
46     48     260  6.9   81     7  16
47     35     274 10.3   82     7  17
48     61     285  6.3   84     7  18
49     79     187  5.1   87     7  19
50     63     220 11.5   85     7  20
51     16       7  6.9   74     7  21
52     80     294  8.6   86     7  24
53    108     223  8.0   85     7  25
54     20      81  8.6   82     7  26
55     52      82 12.0   86     7  27
56     82     213  7.4   88     7  28
57     50     275  7.4   86     7  29
58     64     253  7.4   83     7  30
59     59     254  9.2   81     7  31
60     39      83  6.9   81     8   1
61      9      24 13.8   81     8   2
62     16      77  7.4   82     8   3
63    122     255  4.0   89     8   7
64     89     229 10.3   90     8   8
65    110     207  8.0   90     8   9
66     44     192 11.5   86     8  12
67     28     273 11.5   82     8  13
68     65     157  9.7   80     8  14
69     22      71 10.3   77     8  16
70     59      51  6.3   79     8  17
71     23     115  7.4   76     8  18
72     31     244 10.9   78     8  19
73     44     190 10.3   78     8  20
74     21     259 15.5   77     8  21
75      9      36 14.3   72     8  22
76     45     212  9.7   79     8  24
77    168     238  3.4   81     8  25
78     73     215  8.0   86     8  26
79     76     203  9.7   97     8  28
80    118     225  2.3   94     8  29
81     84     237  6.3   96     8  30
82     85     188  6.3   94     8  31
83     96     167  6.9   91     9   1
84     78     197  5.1   92     9   2
85     73     183  2.8   93     9   3
86     91     189  4.6   93     9   4
87     47      95  7.4   87     9   5
88     32      92 15.5   84     9   6
89     20     252 10.9   80     9   7
90     23     220 10.3   78     9   8
91     21     230 10.9   75     9   9
92     24     259  9.7   73     9  10
93     44     236 14.9   81     9  11
94     21     259 15.5   76     9  12
95     28     238  6.3   77     9  13
96      9      24 10.9   71     9  14
97     13     112 11.5   71     9  15
98     46     237  6.9   78     9  16
99     18     224 13.8   67     9  17
100    13      27 10.3   76     9  18
101    24     238 10.3   68     9  19
102    16     201  8.0   82     9  20
103    13     238 12.6   64     9  21
104    23      14  9.2   71     9  22
105    36     139 10.3   81     9  23
106     7      49 10.3   69     9  24
107    14      20 16.6   63     9  25
108    30     193  6.9   70     9  26
109    14     191 14.3   75     9  28
110    18     131  8.0   76     9  29
111    20     223 11.5   68     9  30

Drop columns with any missing values

Use the miss_var_which() function from naniar

miss_var_which(airquality) # which columns have missing values
[1] "Ozone"   "Solar.R"

Drop columns with any missing values

miss_var_which and function from naniar (need a data frame)

airquality %>% select(!miss_var_which(airquality))
    Wind Temp Month Day
1    7.4   67     5   1
2    8.0   72     5   2
3   12.6   74     5   3
4   11.5   62     5   4
5   14.3   56     5   5
6   14.9   66     5   6
7    8.6   65     5   7
8   13.8   59     5   8
9   20.1   61     5   9
10   8.6   69     5  10
11   6.9   74     5  11
12   9.7   69     5  12
13   9.2   66     5  13
14  10.9   68     5  14
15  13.2   58     5  15
16  11.5   64     5  16
17  12.0   66     5  17
18  18.4   57     5  18
19  11.5   68     5  19
20   9.7   62     5  20
21   9.7   59     5  21
22  16.6   73     5  22
23   9.7   61     5  23
24  12.0   61     5  24
25  16.6   57     5  25
26  14.9   58     5  26
27   8.0   57     5  27
28  12.0   67     5  28
29  14.9   81     5  29
30   5.7   79     5  30
31   7.4   76     5  31
32   8.6   78     6   1
33   9.7   74     6   2
34  16.1   67     6   3
35   9.2   84     6   4
36   8.6   85     6   5
37  14.3   79     6   6
38   9.7   82     6   7
39   6.9   87     6   8
40  13.8   90     6   9
41  11.5   87     6  10
42  10.9   93     6  11
43   9.2   92     6  12
44   8.0   82     6  13
45  13.8   80     6  14
46  11.5   79     6  15
47  14.9   77     6  16
48  20.7   72     6  17
49   9.2   65     6  18
50  11.5   73     6  19
51  10.3   76     6  20
52   6.3   77     6  21
53   1.7   76     6  22
54   4.6   76     6  23
55   6.3   76     6  24
56   8.0   75     6  25
57   8.0   78     6  26
58  10.3   73     6  27
59  11.5   80     6  28
60  14.9   77     6  29
61   8.0   83     6  30
62   4.1   84     7   1
63   9.2   85     7   2
64   9.2   81     7   3
65  10.9   84     7   4
66   4.6   83     7   5
67  10.9   83     7   6
68   5.1   88     7   7
69   6.3   92     7   8
70   5.7   92     7   9
71   7.4   89     7  10
72   8.6   82     7  11
73  14.3   73     7  12
74  14.9   81     7  13
75  14.9   91     7  14
76  14.3   80     7  15
77   6.9   81     7  16
78  10.3   82     7  17
79   6.3   84     7  18
80   5.1   87     7  19
81  11.5   85     7  20
82   6.9   74     7  21
83   9.7   81     7  22
84  11.5   82     7  23
85   8.6   86     7  24
86   8.0   85     7  25
87   8.6   82     7  26
88  12.0   86     7  27
89   7.4   88     7  28
90   7.4   86     7  29
91   7.4   83     7  30
92   9.2   81     7  31
93   6.9   81     8   1
94  13.8   81     8   2
95   7.4   82     8   3
96   6.9   86     8   4
97   7.4   85     8   5
98   4.6   87     8   6
99   4.0   89     8   7
100 10.3   90     8   8
101  8.0   90     8   9
102  8.6   92     8  10
103 11.5   86     8  11
104 11.5   86     8  12
105 11.5   82     8  13
106  9.7   80     8  14
107 11.5   79     8  15
108 10.3   77     8  16
109  6.3   79     8  17
110  7.4   76     8  18
111 10.9   78     8  19
112 10.3   78     8  20
113 15.5   77     8  21
114 14.3   72     8  22
115 12.6   75     8  23
116  9.7   79     8  24
117  3.4   81     8  25
118  8.0   86     8  26
119  5.7   88     8  27
120  9.7   97     8  28
121  2.3   94     8  29
122  6.3   96     8  30
123  6.3   94     8  31
124  6.9   91     9   1
125  5.1   92     9   2
126  2.8   93     9   3
127  4.6   93     9   4
128  7.4   87     9   5
129 15.5   84     9   6
130 10.9   80     9   7
131 10.3   78     9   8
132 10.9   75     9   9
133  9.7   73     9  10
134 14.9   81     9  11
135 15.5   76     9  12
136  6.3   77     9  13
137 10.9   71     9  14
138 11.5   71     9  15
139  6.9   78     9  16
140 13.8   67     9  17
141 10.3   76     9  18
142 10.3   68     9  19
143  8.0   82     9  20
144 12.6   64     9  21
145  9.2   71     9  22
146 10.3   81     9  23
147 10.3   69     9  24
148 16.6   63     9  25
149  6.9   70     9  26
150 13.2   77     9  27
151 14.3   75     9  28
152  8.0   76     9  29
153 11.5   68     9  30

Change a value to be NA

Let’s say we think that all 0 values should be NA.

library(readr)
bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv")
bike <-bike %>% select(type, dateInstalled)
bike
# A tibble: 1,631 × 2
   type           dateInstalled
   <chr>                  <dbl>
 1 BIKE BOULEVARD             0
 2 SIDEPATH                2010
 3 SIGNED ROUTE            2010
 4 SIDEPATH                   0
 5 BIKE LANE               2011
 6 SIGNED ROUTE            2007
 7 SIGNED ROUTE            2007
 8 SIGNED ROUTE            2007
 9 BIKE LANE               2009
10 SHARROW                 2007
# ℹ 1,621 more rows

Change a value to be NA

The na_if() function of dplyr can be helpful for changing all 0 values to NA.

More on mutate() soon!

bike %>% 
  mutate(dateInstalled = na_if(dateInstalled, 0))
# A tibble: 1,631 × 2
   type           dateInstalled
   <chr>                  <dbl>
 1 BIKE BOULEVARD            NA
 2 SIDEPATH                2010
 3 SIGNED ROUTE            2010
 4 SIDEPATH                  NA
 5 BIKE LANE               2011
 6 SIGNED ROUTE            2007
 7 SIGNED ROUTE            2007
 8 SIGNED ROUTE            2007
 9 BIKE LANE               2009
10 SHARROW                 2007
# ℹ 1,621 more rows

Change NA to be a value

The replace_na() function (part of the tidyr package), can do the opposite of na_if().

Replacement much match the class of the other values in the column (e.g., character, numeric).

bike %>% 
  mutate(dateInstalled = replace_na(dateInstalled, 2005))
# A tibble: 1,631 × 2
   type           dateInstalled
   <chr>                  <dbl>
 1 BIKE BOULEVARD             0
 2 SIDEPATH                2010
 3 SIGNED ROUTE            2010
 4 SIDEPATH                   0
 5 BIKE LANE               2011
 6 SIGNED ROUTE            2007
 7 SIGNED ROUTE            2007
 8 SIGNED ROUTE            2007
 9 BIKE LANE               2009
10 SHARROW                 2007
# ℹ 1,621 more rows

Think about NA

THINK ABOUT YOUR DATA FIRST!

⚠️ Sometimes removing NA values leads to distorted math - be careful!

⚠️ Think about what your NA means for your data (are you sure ?).

  • Is an NA for values so low they could not be reported?

  • Or is it if it was too low and also if there was a different issue (like no one reported)?

Think about NA

If it is something more like a zero then you might want it included in your data like a zero instead of an NA.

Example: - survey reports NA if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week

⚠️ You might want to keep the NA values so that you know the original sample size.

Word of caution

⚠️ Calculating percentages will give you a different result depending on your choice to include NA values.!

This is because the denominator changes.

Word of caution - Percentages with NA

count(bike, dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 9 × 3
  dateInstalled     n percent
          <dbl> <int>   <dbl>
1             0   126   7.73 
2          2006     2   0.123
3          2007   368  22.6  
4          2008   206  12.6  
5          2009    86   5.27 
6          2010   625  38.3  
7          2011   101   6.19 
8          2012   107   6.56 
9          2013    10   0.613

Word of caution - Percentages with NA

bike %>% drop_na(dateInstalled) %>% 
  count(dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 9 × 3
  dateInstalled     n percent
          <dbl> <int>   <dbl>
1             0   126   7.73 
2          2006     2   0.123
3          2007   368  22.6  
4          2008   206  12.6  
5          2009    86   5.27 
6          2010   625  38.3  
7          2011   101   6.19 
8          2012   107   6.56 
9          2013    10   0.613

Should you be dividing by the total count with NA values included?
It depends on your data and what NA might mean.
Pay attention to your data and your NA values!

Summary

  • is.na(),any(is.na()), count(), and functions from naniar like gg_miss_var() can help determine if we have NA values
  • filter() automatically removes NA values - can’t confirm or deny if condition is met (need | is.na() to keep them)
  • drop_na() can help you remove NA values from a variable or an entire data frame
  • NA values can change your calculation results
  • think about what NA values represent - don’t drop them if you shouldn’t

https://sisbid.github.io/Data-Wrangling/labs/missing-data-lab.Rmd