1.17 Missing data
1.17.1 Identifying missing data
In Excel, missing data can show up as a blank cell. In SPSS it is represented as a .
period. R displays missing data as NA
values.
Missing Data in SPSS: https://stats.idre.ucla.edu/spss/modules/missing-data/
Why would data be missing? Other than the obvious data entry errors, tech glitches or just non-cooperative plants or people, sometimes values are out of range and you would rather delete them than change their value (data edit).
Lets look at the religion variable in the depression data set.
Looking at the codebook, there is no category 6
for religion. Let’s change all values to NA
.
This code says take all rows where relig
is equal to 6, and change them to NA
.
Confirm recode.
Notice the use of the useNA="always"
argument. If we just looked at the base table without this argument, we would have never known there was missing data!
What about continuous variables? Well there happens to be no other missing data in this data set, so let’s make up a set of 7 data points stored in a variable named y
.
The #1 way to identify missing data in a continuous variable is by looking at the summary()
values.
mean(y)
## [1] NA
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 2.0 3.0 3.2 4.0 6.0 2
mean(y, na.rm=TRUE)
## [1] 3.2
In R, any arithmetic function (like addition, multiplication) on missing data results in a missing value. The na.rm=TRUE
toggle tells R to calculate the complete case mean. This is a biased measure of the mean, but missing data is a topic worthy of it’s own course and is introduced in Chapter 18.
1.17.2 Model predictions
Situation: You want to add model predictions to the data set, but you have missing data that was automatically dropped prior to analysis.
1.17.3 Regression
R objects created by methods such as lm
and glm
will store the data used in the model in the model object itself in model$data
. See Chapter 12 for an example.
1.17.4 Factor Analysis and Principle Components
If your original data had missing values, here is one way to get the PC’s / factor scores for available data back onto the data set.
Method 1) Create an ID column and merge new variables onto original data. (add columns)
- If no ID column exists, create one on the original dataset
id = 1:NROW(data)
- Use
select()
to extract the ID and all variables used in the factor analysis, then do ana.omit()
to drop rows with any missing data. Save this as a new complete case data set. - Conduct PCA / Factor analysis on this new complete case data set (MINUS THE ID). Extract the PCs or factor scores.
- Use
bind_cols()
to add the ID variable to the data containing factor scores. - Then
left_join(original_data, factor_score_data)
the factor scores back to the original data, using the ID variable as the joining key.
Method 2) Split the data, analyze one part then concatenate back together. (add rows)
- Use the
complete.cases()
function to create a boolean vector for if each row is complete - Split the data into complete and incomplete.
- Do the analysis on the complete rows, extracting the PC’s/Factors
- Add the PC/Factor data onto the complete rows using
bind_cols
- Then
bind_rows
the two parts back together.