18.1 Identifying missing data

  • Missing data in R is denoted as NA
  • Arithmetic functions on missing data will return missing

The summary() function will always show missing.

The is.na() function is helpful to identify rows with missing data

The function table() will not show NA by default.

  • What percent of the data set is missing?

4% of the data points are missing.

  • How much missing is there per variable?

The amount of missing data per variable varies from 0 to 19%.

18.1.1 Visualize missing patterns

Using ggplot2

Using mice

##     Fold Exer Age Sex Wr.Hnd NW.Hnd W.Hnd Clap Smoke Height M.I Pulse    
## 168    1    1   1   1      1      1     1    1     1      1   1     1   0
## 38     1    1   1   1      1      1     1    1     1      1   1     0   1
## 20     1    1   1   1      1      1     1    1     1      0   0     1   2
## 7      1    1   1   1      1      1     1    1     1      0   0     0   3
## 1      1    1   1   1      1      1     1    1     0      0   0     1   3
## 1      1    1   1   1      1      1     0    1     1      1   1     1   1
## 1      1    1   1   1      0      0     1    0     1      1   1     1   3
## 1      1    1   1   0      1      1     1    1     1      1   1     1   1
##        0    0   0   1      1      1     1    1     1     28  28    45 107

This somewhat ugly output tells us that 168 records have no missing data, 38 records are missing only Pulse and 20 are missing both Height and M.I.

Using VIM

The plot on the left is a simplified, and ordered version of the ggplot from above, except the bars appear to be inflated because the y-axis goes up to 15% instead of 100%.

The plot on the right shows the missing data patterns, and indicate that 71% of the records has complete cases, and that everyone who is missing M.I. is also missing Height.

Another plot that can be helpful to identify patterns of missing data is a marginplot (also from VIM).

  • Two continuous variables are plotted against each other.
  • Blue bivariate scatterplot and univariate boxplots are for the observations where values on both variables are observed.
  • Red univariate dotplots and boxplots are drawn for the data that is only observed on one of the two variables in question.
  • The darkred text indicates how many records are missing on both.

This shows us that the observations missing pulse have the same median height, but those missing height have a higher median pulse rate.