1.6 Dealing with missing data post-analysis
Situation: You want to add model predictions to the data set, but you have missing data that was automatically dropped prior to analysis.
1.6.1 Regression
R objects created by methods such as lm
and glm
will store the data used in the model in the model object itself in model$data
. See Chapter 12 for an example.
1.6.2 Factor Analysis and Principle Components
If your original data had missing values, here is one way to get the PC’s / factor scores for available data back onto the data set.
Method 1) Create an ID column and merge new variables onto original data. (add columns)
- If no ID column exists, create one on the original dataset
id = 1:NROW(data)
- Use
select()
to extract the ID and all variables used in the factor analysis, then do ana.omit()
to drop rows with any missing data. Save this as a new complete case data set. - Conduct PCA / Factor analysis on this new complete case data set (MINUS THE ID). Extract the PCs or factor scores.
- Use
bind_cols()
to add the ID variable to the data containing factor scores. - Then
left_join(original_data, factor_score_data)
the factor scores back to the original data, using the ID variable as the joining key.
Method 2) Split the data, analyze one part then concatenate back together. (add rows)
- Use the
complete.cases()
function to create a boolean vector for if each row is complete - Split the data into complete and incomplete.
- Do the analysis on the complete rows, extracting the PC’s/Factors
- Add the PC/Factor data onto the complete rows using
bind_cols
- Then
bind_rows
the two parts back together.