1.17 Missing data

1.17.1 Identifying missing data

In Excel, missing data can show up as a blank cell. In SPSS it is represented as a . period. R displays missing data as NA values.

Missing Data in SPSS: https://stats.idre.ucla.edu/spss/modules/missing-data/

Why would data be missing? Other than the obvious data entry errors, tech glitches or just non-cooperative plants or people, sometimes values are out of range and you would rather delete them than change their value (data edit).

Lets look at the religion variable in the depression data set.

table(depress$relig, useNA="always")
## 
##    1    2    3    4    6 <NA> 
##  155   51   30   56    2    0

Looking at the codebook, there is no category 6 for religion. Let’s change all values to NA.

depress$relig[depress$relig==6] <- NA

This code says take all rows where relig is equal to 6, and change them to NA.

Confirm recode.

table(depress$relig, useNA="always")
## 
##    1    2    3    4 <NA> 
##  155   51   30   56    2

Notice the use of the useNA="always" argument. If we just looked at the base table without this argument, we would have never known there was missing data!

table(depress$relig)
## 
##   1   2   3   4 
## 155  51  30  56

What about continuous variables? Well there happens to be no other missing data in this data set, so let’s make up a set of 7 data points stored in a variable named y.

y <- c(1, 2, 3, NA, 4, NA, 6)
y
## [1]  1  2  3 NA  4 NA  6

The #1 way to identify missing data in a continuous variable is by looking at the summary() values.

mean(y)
## [1] NA
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0     2.0     3.0     3.2     4.0     6.0       2
mean(y, na.rm=TRUE)
## [1] 3.2

In R, any arithmetic function (like addition, multiplication) on missing data results in a missing value. The na.rm=TRUE toggle tells R to calculate the complete case mean. This is a biased measure of the mean, but missing data is a topic worthy of it’s own course and is introduced in Chapter 18.

1.17.2 Model predictions

Situation: You want to add model predictions to the data set, but you have missing data that was automatically dropped prior to analysis.

1.17.3 Regression

R objects created by methods such as lm and glm will store the data used in the model in the model object itself in model$data. See Chapter 12 for an example.

1.17.4 Factor Analysis and Principle Components

If your original data had missing values, here is one way to get the PC’s / factor scores for available data back onto the data set.

Method 1) Create an ID column and merge new variables onto original data. (add columns)

If no ID column exists, create one on the original dataset id = 1:NROW(data)
Use select() to extract the ID and all variables used in the factor analysis, then do a na.omit() to drop rows with any missing data. Save this as a new complete case data set.
Conduct PCA / Factor analysis on this new complete case data set (MINUS THE ID). Extract the PCs or factor scores.
Use bind_cols() to add the ID variable to the data containing factor scores.
Then left_join(original_data, factor_score_data) the factor scores back to the original data, using the ID variable as the joining key.

Method 2) Split the data, analyze one part then concatenate back together. (add rows)

Use the complete.cases() function to create a boolean vector for if each row is complete
Split the data into complete and incomplete.
Do the analysis on the complete rows, extracting the PC’s/Factors
Add the PC/Factor data onto the complete rows using bind_cols
Then bind_rows the two parts back together.

cc.idx <- hiv %>% select(starts_with("pb")) %>% complete.cases() # 1

complete.rows <- hiv[cc.idx,] #2
incomplete.rows <- hiv[!cc.idx,]

pc.scores <- princomp(pb)$scores #3 

complete.add.pc <- bind_cols(complete.rows, pc.scores) #4

hiv.with.pcs <- bind_rows(complete.add.pc, incomplete.rows) #5