Dealing with Missing Data

Missing data types

One of the most important aspects of data cleaning is missing values.

Types of “missing” data:

  • NA - general missing data
  • NaN - stands for “Not a Number”, happens when you do 0/0.
  • Inf and -Inf - Infinity, happens when you take a positive number (or negative number) by 0.

Missing Data with Logicals

Logical operations return NA for NA values. Think about it, the data could be > 2 or not we don’t know, so R says there is no TRUE or FALSE, so that is missing:

x = c(0, NA, 2, 3, 4)
x > 2
[1] FALSE    NA FALSE  TRUE  TRUE

Missing Data with Operations

Similarly with logicals, operations/arithmetic with NA will result in NAs:

x + 2
[1]  2 NA  4  5  6
x * 2
[1]  0 NA  4  6  8

Missing Data Issues

Recall that mathematical operations with NA often result in NAs.

sum(c(1,2,3,NA))
[1] NA
mean(c(1,2,3,NA))
[1] NA
median(c(1,2,3,NA))
[1] NA

Missing Data Issues

Also true when we combine mathematical operations and logicals. Recall that TRUE is evaluated as 1 and FALSE is evaluated as 0.

x <- c(TRUE, TRUE, TRUE, TRUE, FALSE, NA)
sum(x)
[1] NA
sum(x, na.rm = TRUE)
[1] 4

Finding Missing data

  • is.na - looks for NAN and NA
  • is.nan- looks for NAN
  • is.infinite - looks for Inf or -Inf
test <- c(0,NA, -1)
test/0
[1]  NaN   NA -Inf
test <- test/0
is.na(test)
[1]  TRUE  TRUE FALSE
is.nan(test)
[1]  TRUE FALSE FALSE
is.infinite(test)
[1] FALSE FALSE  TRUE

Useful checking functions

Do we have any NAs? (any can help)

A <- c(1, 2, 3, NA)
B <- c(1, 2, 3, 4)
any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE

Useful checking functions

Are all the values NA? (all can help)

A <- c(1, 2, 3, NA)
B <- c(1, 2, 3, 4)
all(is.na(A)) # are there any NAs - YES/TRUE
[1] FALSE
all(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE

Finding NA values with count()

Check the values for your variables, are they what you expect?

count() is a great option because it gives you:

  1. The unique values
  2. The amount of these values

Check if rare values make sense.

library(readr)
bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv")
count(bike, subType)
# A tibble: 4 × 2
  subType     n
  <chr>   <int>
1 STCLN       1
2 STRALY      3
3 STRPRD   1623
4 <NA>        4

naniar

Sometimes you need to look at lots of data though… the naniar package is a good option.

#install.packages("naniar")
library(naniar)

Air quality data

The airquality dataset comes with R about air quality in New York in 1973.

?airquality # use this to find out more about the data

naniar: pct_complete()

This can tell you if there are missing values in the dataset.

pct_complete(airquality)
[1] 95.20697

Or for a particular variable:

airquality %>% select(Ozone) %>%
pct_complete()
[1] 75.81699

naniar:miss_var_summary()

To get the percent missing (and counts) for each variable as a table, use this function.

miss_var_summary(airquality)
# A tibble: 6 × 3
  variable n_miss pct_miss
  <chr>     <int>    <dbl>
1 Ozone        37    24.2 
2 Solar.R       7     4.58
3 Wind          0     0   
4 Temp          0     0   
5 Month         0     0   
6 Day           0     0   

miss_case_summary which rows have missing data in order

miss_case_summary(airquality)
# A tibble: 153 × 3
    case n_miss pct_miss
   <int>  <int>    <dbl>
 1     5      2     33.3
 2    27      2     33.3
 3     6      1     16.7
 4    10      1     16.7
 5    11      1     16.7
 6    25      1     16.7
 7    26      1     16.7
 8    32      1     16.7
 9    33      1     16.7
10    34      1     16.7
# ℹ 143 more rows

naniar plots

The gg_miss_var() function creates a nice plot about the number of missing values for each variable, (need a data frame).

gg_miss_var(airquality)

filter() and missing data

Be careful with missing data using subsetting!

filter() removes missing values by default. Because R can’t tell for sure if an NA value meets the condition. To keep them need to add is.na() conditional.

Think about if this is OK or not - it depends on your data!

filter() and missing data

What if NA values represent values that are so low it is undetectable?

Filter will drop them from the data.

airquality %>% filter(Ozone < 5)
  Ozone Solar.R Wind Temp Month Day
1     1       8  9.7   59     5  21
2     4      25  9.7   61     5  23

filter() and missing data

is.na() can help us keep them.

airquality %>% filter(Ozone < 5 | is.na(Ozone))
   Ozone Solar.R Wind Temp Month Day
1     NA      NA 14.3   56     5   5
2     NA     194  8.6   69     5  10
3      1       8  9.7   59     5  21
4      4      25  9.7   61     5  23
5     NA      66 16.6   57     5  25
6     NA     266 14.9   58     5  26
7     NA      NA  8.0   57     5  27
8     NA     286  8.6   78     6   1
9     NA     287  9.7   74     6   2
10    NA     242 16.1   67     6   3
11    NA     186  9.2   84     6   4
12    NA     220  8.6   85     6   5
13    NA     264 14.3   79     6   6
14    NA     273  6.9   87     6   8
15    NA     259 10.9   93     6  11
16    NA     250  9.2   92     6  12
17    NA     332 13.8   80     6  14
18    NA     322 11.5   79     6  15
19    NA     150  6.3   77     6  21
20    NA      59  1.7   76     6  22
21    NA      91  4.6   76     6  23
22    NA     250  6.3   76     6  24
23    NA     135  8.0   75     6  25
24    NA     127  8.0   78     6  26
25    NA      47 10.3   73     6  27
26    NA      98 11.5   80     6  28
27    NA      31 14.9   77     6  29
28    NA     138  8.0   83     6  30
29    NA     101 10.9   84     7   4
30    NA     139  8.6   82     7  11
31    NA     291 14.9   91     7  14
32    NA     258  9.7   81     7  22
33    NA     295 11.5   82     7  23
34    NA     222  8.6   92     8  10
35    NA     137 11.5   86     8  11
36    NA      64 11.5   79     8  15
37    NA     255 12.6   75     8  23
38    NA     153  5.7   88     8  27
39    NA     145 13.2   77     9  27

To remove rows with NA values for a variable use drop_na()

A function from the tidyr package. (Need a data frame to start!)

Disclaimer: Don’t do this unless you have thought about if dropping NA values makes sense based on knowing what these values mean in your data.

dim(airquality)
[1] 153   6
airquality %>% drop_na(Ozone)
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      28      NA 14.9   66     5   6
6      23     299  8.6   65     5   7
7      19      99 13.8   59     5   8
8       8      19 20.1   61     5   9
9       7      NA  6.9   74     5  11
10     16     256  9.7   69     5  12
11     11     290  9.2   66     5  13
12     14     274 10.9   68     5  14
13     18      65 13.2   58     5  15
14     14     334 11.5   64     5  16
15     34     307 12.0   66     5  17
16      6      78 18.4   57     5  18
17     30     322 11.5   68     5  19
18     11      44  9.7   62     5  20
19      1       8  9.7   59     5  21
20     11     320 16.6   73     5  22
21      4      25  9.7   61     5  23
22     32      92 12.0   61     5  24
23     23      13 12.0   67     5  28
24     45     252 14.9   81     5  29
25    115     223  5.7   79     5  30
26     37     279  7.4   76     5  31
27     29     127  9.7   82     6   7
28     71     291 13.8   90     6   9
29     39     323 11.5   87     6  10
30     23     148  8.0   82     6  13
31     21     191 14.9   77     6  16
32     37     284 20.7   72     6  17
33     20      37  9.2   65     6  18
34     12     120 11.5   73     6  19
35     13     137 10.3   76     6  20
36    135     269  4.1   84     7   1
37     49     248  9.2   85     7   2
38     32     236  9.2   81     7   3
39     64     175  4.6   83     7   5
40     40     314 10.9   83     7   6
41     77     276  5.1   88     7   7
42     97     267  6.3   92     7   8
43     97     272  5.7   92     7   9
44     85     175  7.4   89     7  10
45     10     264 14.3   73     7  12
46     27     175 14.9   81     7  13
47      7      48 14.3   80     7  15
48     48     260  6.9   81     7  16
49     35     274 10.3   82     7  17
50     61     285  6.3   84     7  18
51     79     187  5.1   87     7  19
52     63     220 11.5   85     7  20
53     16       7  6.9   74     7  21
54     80     294  8.6   86     7  24
55    108     223  8.0   85     7  25
56     20      81  8.6   82     7  26
57     52      82 12.0   86     7  27
58     82     213  7.4   88     7  28
59     50     275  7.4   86     7  29
60     64     253  7.4   83     7  30
61     59     254  9.2   81     7  31
62     39      83  6.9   81     8   1
63      9      24 13.8   81     8   2
64     16      77  7.4   82     8   3
65     78      NA  6.9   86     8   4
66     35      NA  7.4   85     8   5
67     66      NA  4.6   87     8   6
68    122     255  4.0   89     8   7
69     89     229 10.3   90     8   8
70    110     207  8.0   90     8   9
71     44     192 11.5   86     8  12
72     28     273 11.5   82     8  13
73     65     157  9.7   80     8  14
74     22      71 10.3   77     8  16
75     59      51  6.3   79     8  17
76     23     115  7.4   76     8  18
77     31     244 10.9   78     8  19
78     44     190 10.3   78     8  20
79     21     259 15.5   77     8  21
80      9      36 14.3   72     8  22
81     45     212  9.7   79     8  24
82    168     238  3.4   81     8  25
83     73     215  8.0   86     8  26
84     76     203  9.7   97     8  28
85    118     225  2.3   94     8  29
86     84     237  6.3   96     8  30
87     85     188  6.3   94     8  31
88     96     167  6.9   91     9   1
89     78     197  5.1   92     9   2
90     73     183  2.8   93     9   3
91     91     189  4.6   93     9   4
92     47      95  7.4   87     9   5
93     32      92 15.5   84     9   6
94     20     252 10.9   80     9   7
95     23     220 10.3   78     9   8
96     21     230 10.9   75     9   9
97     24     259  9.7   73     9  10
98     44     236 14.9   81     9  11
99     21     259 15.5   76     9  12
100    28     238  6.3   77     9  13
101     9      24 10.9   71     9  14
102    13     112 11.5   71     9  15
103    46     237  6.9   78     9  16
104    18     224 13.8   67     9  17
105    13      27 10.3   76     9  18
106    24     238 10.3   68     9  19
107    16     201  8.0   82     9  20
108    13     238 12.6   64     9  21
109    23      14  9.2   71     9  22
110    36     139 10.3   81     9  23
111     7      49 10.3   69     9  24
112    14      20 16.6   63     9  25
113    30     193  6.9   70     9  26
114    14     191 14.3   75     9  28
115    18     131  8.0   76     9  29
116    20     223 11.5   68     9  30

To remove rows with NA values for a data frame use drop_na()

This function of the tidyr package drops rows with any missing data in any column when used on a df.

airquality %>% drop_na()
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      23     299  8.6   65     5   7
6      19      99 13.8   59     5   8
7       8      19 20.1   61     5   9
8      16     256  9.7   69     5  12
9      11     290  9.2   66     5  13
10     14     274 10.9   68     5  14
11     18      65 13.2   58     5  15
12     14     334 11.5   64     5  16
13     34     307 12.0   66     5  17
14      6      78 18.4   57     5  18
15     30     322 11.5   68     5  19
16     11      44  9.7   62     5  20
17      1       8  9.7   59     5  21
18     11     320 16.6   73     5  22
19      4      25  9.7   61     5  23
20     32      92 12.0   61     5  24
21     23      13 12.0   67     5  28
22     45     252 14.9   81     5  29
23    115     223  5.7   79     5  30
24     37     279  7.4   76     5  31
25     29     127  9.7   82     6   7
26     71     291 13.8   90     6   9
27     39     323 11.5   87     6  10
28     23     148  8.0   82     6  13
29     21     191 14.9   77     6  16
30     37     284 20.7   72     6  17
31     20      37  9.2   65     6  18
32     12     120 11.5   73     6  19
33     13     137 10.3   76     6  20
34    135     269  4.1   84     7   1
35     49     248  9.2   85     7   2
36     32     236  9.2   81     7   3
37     64     175  4.6   83     7   5
38     40     314 10.9   83     7   6
39     77     276  5.1   88     7   7
40     97     267  6.3   92     7   8
41     97     272  5.7   92     7   9
42     85     175  7.4   89     7  10
43     10     264 14.3   73     7  12
44     27     175 14.9   81     7  13
45      7      48 14.3   80     7  15
46     48     260  6.9   81     7  16
47     35     274 10.3   82     7  17
48     61     285  6.3   84     7  18
49     79     187  5.1   87     7  19
50     63     220 11.5   85     7  20
51     16       7  6.9   74     7  21
52     80     294  8.6   86     7  24
53    108     223  8.0   85     7  25
54     20      81  8.6   82     7  26
55     52      82 12.0   86     7  27
56     82     213  7.4   88     7  28
57     50     275  7.4   86     7  29
58     64     253  7.4   83     7  30
59     59     254  9.2   81     7  31
60     39      83  6.9   81     8   1
61      9      24 13.8   81     8   2
62     16      77  7.4   82     8   3
63    122     255  4.0   89     8   7
64     89     229 10.3   90     8   8
65    110     207  8.0   90     8   9
66     44     192 11.5   86     8  12
67     28     273 11.5   82     8  13
68     65     157  9.7   80     8  14
69     22      71 10.3   77     8  16
70     59      51  6.3   79     8  17
71     23     115  7.4   76     8  18
72     31     244 10.9   78     8  19
73     44     190 10.3   78     8  20
74     21     259 15.5   77     8  21
75      9      36 14.3   72     8  22
76     45     212  9.7   79     8  24
77    168     238  3.4   81     8  25
78     73     215  8.0   86     8  26
79     76     203  9.7   97     8  28
80    118     225  2.3   94     8  29
81     84     237  6.3   96     8  30
82     85     188  6.3   94     8  31
83     96     167  6.9   91     9   1
84     78     197  5.1   92     9   2
85     73     183  2.8   93     9   3
86     91     189  4.6   93     9   4
87     47      95  7.4   87     9   5
88     32      92 15.5   84     9   6
89     20     252 10.9   80     9   7
90     23     220 10.3   78     9   8
91     21     230 10.9   75     9   9
92     24     259  9.7   73     9  10
93     44     236 14.9   81     9  11
94     21     259 15.5   76     9  12
95     28     238  6.3   77     9  13
96      9      24 10.9   71     9  14
97     13     112 11.5   71     9  15
98     46     237  6.9   78     9  16
99     18     224 13.8   67     9  17
100    13      27 10.3   76     9  18
101    24     238 10.3   68     9  19
102    16     201  8.0   82     9  20
103    13     238 12.6   64     9  21
104    23      14  9.2   71     9  22
105    36     139 10.3   81     9  23
106     7      49 10.3   69     9  24
107    14      20 16.6   63     9  25
108    30     193  6.9   70     9  26
109    14     191 14.3   75     9  28
110    18     131  8.0   76     9  29
111    20     223 11.5   68     9  30

Drop columns with any missing values

Use the miss_var_which() function from naniar

miss_var_which(airquality) # which columns have missing values
[1] "Ozone"   "Solar.R"

Drop columns with any missing values

miss_var_which and function from naniar (need a data frame)

airquality %>% select(!miss_var_which(airquality))
    Wind Temp Month Day
1    7.4   67     5   1
2    8.0   72     5   2
3   12.6   74     5   3
4   11.5   62     5   4
5   14.3   56     5   5
6   14.9   66     5   6
7    8.6   65     5   7
8   13.8   59     5   8
9   20.1   61     5   9
10   8.6   69     5  10
11   6.9   74     5  11
12   9.7   69     5  12
13   9.2   66     5  13
14  10.9   68     5  14
15  13.2   58     5  15
16  11.5   64     5  16
17  12.0   66     5  17
18  18.4   57     5  18
19  11.5   68     5  19
20   9.7   62     5  20
21   9.7   59     5  21
22  16.6   73     5  22
23   9.7   61     5  23
24  12.0   61     5  24
25  16.6   57     5  25
26  14.9   58     5  26
27   8.0   57     5  27
28  12.0   67     5  28
29  14.9   81     5  29
30   5.7   79     5  30
31   7.4   76     5  31
32   8.6   78     6   1
33   9.7   74     6   2
34  16.1   67     6   3
35   9.2   84     6   4
36   8.6   85     6   5
37  14.3   79     6   6
38   9.7   82     6   7
39   6.9   87     6   8
40  13.8   90     6   9
41  11.5   87     6  10
42  10.9   93     6  11
43   9.2   92     6  12
44   8.0   82     6  13
45  13.8   80     6  14
46  11.5   79     6  15
47  14.9   77     6  16
48  20.7   72     6  17
49   9.2   65     6  18
50  11.5   73     6  19
51  10.3   76     6  20
52   6.3   77     6  21
53   1.7   76     6  22
54   4.6   76     6  23
55   6.3   76     6  24
56   8.0   75     6  25
57   8.0   78     6  26
58  10.3   73     6  27
59  11.5   80     6  28
60  14.9   77     6  29
61   8.0   83     6  30
62   4.1   84     7   1
63   9.2   85     7   2
64   9.2   81     7   3
65  10.9   84     7   4
66   4.6   83     7   5
67  10.9   83     7   6
68   5.1   88     7   7
69   6.3   92     7   8
70   5.7   92     7   9
71   7.4   89     7  10
72   8.6   82     7  11
73  14.3   73     7  12
74  14.9   81     7  13
75  14.9   91     7  14
76  14.3   80     7  15
77   6.9   81     7  16
78  10.3   82     7  17
79   6.3   84     7  18
80   5.1   87     7  19
81  11.5   85     7  20
82   6.9   74     7  21
83   9.7   81     7  22
84  11.5   82     7  23
85   8.6   86     7  24
86   8.0   85     7  25
87   8.6   82     7  26
88  12.0   86     7  27
89   7.4   88     7  28
90   7.4   86     7  29
91   7.4   83     7  30
92   9.2   81     7  31
93   6.9   81     8   1
94  13.8   81     8   2
95   7.4   82     8   3
96   6.9   86     8   4
97   7.4   85     8   5
98   4.6   87     8   6
99   4.0   89     8   7
100 10.3   90     8   8
101  8.0   90     8   9
102  8.6   92     8  10
103 11.5   86     8  11
104 11.5   86     8  12
105 11.5   82     8  13
106  9.7   80     8  14
107 11.5   79     8  15
108 10.3   77     8  16
109  6.3   79     8  17
110  7.4   76     8  18
111 10.9   78     8  19
112 10.3   78     8  20
113 15.5   77     8  21
114 14.3   72     8  22
115 12.6   75     8  23
116  9.7   79     8  24
117  3.4   81     8  25
118  8.0   86     8  26
119  5.7   88     8  27
120  9.7   97     8  28
121  2.3   94     8  29
122  6.3   96     8  30
123  6.3   94     8  31
124  6.9   91     9   1
125  5.1   92     9   2
126  2.8   93     9   3
127  4.6   93     9   4
128  7.4   87     9   5
129 15.5   84     9   6
130 10.9   80     9   7
131 10.3   78     9   8
132 10.9   75     9   9
133  9.7   73     9  10
134 14.9   81     9  11
135 15.5   76     9  12
136  6.3   77     9  13
137 10.9   71     9  14
138 11.5   71     9  15
139  6.9   78     9  16
140 13.8   67     9  17
141 10.3   76     9  18
142 10.3   68     9  19
143  8.0   82     9  20
144 12.6   64     9  21
145  9.2   71     9  22
146 10.3   81     9  23
147 10.3   69     9  24
148 16.6   63     9  25
149  6.9   70     9  26
150 13.2   77     9  27
151 14.3   75     9  28
152  8.0   76     9  29
153 11.5   68     9  30

Change a value to be NA

Let’s say we think that all 0 values should be NA.

library(readr)
bike <-read_csv("https://sisbid.github.io/Data-Wrangling/labs/Bike_Lanes.csv")
count(bike, dateInstalled)
# A tibble: 9 × 2
  dateInstalled     n
          <dbl> <int>
1             0   126
2          2006     2
3          2007   368
4          2008   206
5          2009    86
6          2010   625
7          2011   101
8          2012   107
9          2013    10

Change a value to be NA

The na_if() function of dplyr can be helpful for changing all 0 values to NA.

bike <- bike %>% 
  mutate(dateInstalled = na_if(dateInstalled, 0))
count(bike, dateInstalled)
# A tibble: 9 × 2
  dateInstalled     n
          <dbl> <int>
1          2006     2
2          2007   368
3          2008   206
4          2009    86
5          2010   625
6          2011   101
7          2012   107
8          2013    10
9            NA   126

Change NA to be a value

The replace_na() function (part of the tidyr package), can do the opposite of na_if(). (note that you must use numeric values as replacement - we will show how to replace with character strings soon)

bike %>% 
  mutate(dateInstalled = replace_na(dateInstalled, 2005)) %>% 
  count(dateInstalled)
# A tibble: 9 × 2
  dateInstalled     n
          <dbl> <int>
1          2005   126
2          2006     2
3          2007   368
4          2008   206
5          2009    86
6          2010   625
7          2011   101
8          2012   107
9          2013    10

Think about NA

THINK ABOUT YOUR DATA FIRST!

⚠️ Sometimes removing NA values leads to distorted math - be careful!

⚠️ Think about what your NA means for your data (are you sure ?).

  • Is an NA for values so low they could not be reported?

  • Or is it if it was too low and also if there was a different issue (like no one reported)?

Think about NA

If it is something more like a zero then you might want it included in your data like a zero instead of an NA.

Example: - survey reports NA if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week

⚠️ You might want to keep the NA values so that you know the original sample size.

Word of caution

⚠️ Calculating percentages will give you a different result depending on your choice to include NA values.!

This is because the denominator changes.

Word of caution - Percentages with NA

count(bike, dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 9 × 3
  dateInstalled     n percent
          <dbl> <int>   <dbl>
1          2006     2   0.123
2          2007   368  22.6  
3          2008   206  12.6  
4          2009    86   5.27 
5          2010   625  38.3  
6          2011   101   6.19 
7          2012   107   6.56 
8          2013    10   0.613
9            NA   126   7.73 

Word of caution - Percentages with NA

bike %>% drop_na(dateInstalled) %>% 
  count(dateInstalled) %>% mutate(percent = (n/(sum(n)) *100))
# A tibble: 8 × 3
  dateInstalled     n percent
          <dbl> <int>   <dbl>
1          2006     2   0.133
2          2007   368  24.5  
3          2008   206  13.7  
4          2009    86   5.71 
5          2010   625  41.5  
6          2011   101   6.71 
7          2012   107   7.11 
8          2013    10   0.664

Should you be dividing by the total count with NA values included?
It depends on your data and what NA might mean.
Pay attention to your data and your NA values!

Summary

  • is.na(),any(is.na()), count(), and functions from naniar like gg_miss_var() can help determine if we have NA values
  • filter() automatically removes NA values - can’t confirm or deny if condition is met (need | is.na() to keep them)
  • drop_na() can help you remove NA values from a variable or an entire data frame
  • NA values can change your calculation results
  • think about what NA values represent - don’t drop them if you shouldn’t