1.6 Dealing with missing data post-analysis

Situation: You want to add model predictions to the data set, but you have missing data that was automatically dropped prior to analysis.

1.6.1 Regression

R objects created by methods such as lm and glm will store the data used in the model in the model object itself in model$data. See Chapter 12 for an example.

1.6.2 Factor Analysis and Principle Components

If your original data had missing values, here is one way to get the PC’s / factor scores for available data back onto the data set.

Method 1) Create an ID column and merge new variables onto original data. (add columns)

  1. If no ID column exists, create one on the original dataset id = 1:NROW(data)
  2. Use select() to extract the ID and all variables used in the factor analysis, then do a na.omit() to drop rows with any missing data. Save this as a new complete case data set.
  3. Conduct PCA / Factor analysis on this new complete case data set (MINUS THE ID). Extract the PCs or factor scores.
  4. Use bind_cols() to add the ID variable to the data containing factor scores.
  5. Then left_join(original_data, factor_score_data) the factor scores back to the original data, using the ID variable as the joining key.

Method 2) Split the data, analyze one part then concatenate back together. (add rows)

  1. Use the complete.cases() function to create a boolean vector for if each row is complete
  2. Split the data into complete and incomplete.
  3. Do the analysis on the complete rows, extracting the PC’s/Factors
  4. Add the PC/Factor data onto the complete rows using bind_cols
  5. Then bind_rows the two parts back together.
cc.idx <- hiv %>% select(starts_with("pb")) %>% complete.cases() # 1

complete.rows <- hiv[cc.idx,] #2
incomplete.rows <- hiv[!cc.idx,]

pc.scores <- princomp(pb)$scores #3 

complete.add.pc <- bind_cols(complete.rows, pc.scores) #4

hiv.with.pcs <- bind_rows(complete.add.pc, incomplete.rows) #5