In Excel, missing data can show up as a blank cell. In SPSS it is represented as a
. period. R displays missing data as
Missing Data in SPSS: https://stats.idre.ucla.edu/spss/modules/missing-data/
Why would data be missing? Other than the obvious data entry errors, tech glitches or just non-cooperative plants or people, sometimes values are out of range and you would rather delete them than change their value (data edit).
Lets look at the religion variable in the depression data set.
Looking at the codebook, there is no category
6 for religion. Let’s change all values to
This code says take all rows where
relig is equal to 6, and change them to
Notice the use of the
useNA="always" argument. If we just looked at the base table without this argument, we would have never known there was missing data!
What about continuous variables? Well there happens to be no other missing data in this data set, so let’s make up a set of 7 data points stored in a variable named
The #1 way to identify missing data in a continuous variable is by looking at the
In R, any arithmetic function (like addition, multiplication) on missing data results in a missing value. The
na.rm=TRUE toggle tells R to calculate the complete case mean. This is a biased measure of the mean, but missing data is a topic worthy of it’s own course and is introduced in Chapter 18.
Situation: You want to add model predictions to the data set, but you have missing data that was automatically dropped prior to analysis.
R objects created by methods such as
glm will store the data used in the model in the model object itself in
model$data. See Chapter 12 for an example.
If your original data had missing values, here is one way to get the PC’s / factor scores for available data back onto the data set.
Method 1) Create an ID column and merge new variables onto original data. (add columns)
- If no ID column exists, create one on the original dataset
id = 1:NROW(data)
select()to extract the ID and all variables used in the factor analysis, then do a
na.omit()to drop rows with any missing data. Save this as a new complete case data set.
- Conduct PCA / Factor analysis on this new complete case data set (MINUS THE ID). Extract the PCs or factor scores.
bind_cols()to add the ID variable to the data containing factor scores.
left_join(original_data, factor_score_data)the factor scores back to the original data, using the ID variable as the joining key.
Method 2) Split the data, analyze one part then concatenate back together. (add rows)
- Use the
complete.cases()function to create a boolean vector for if each row is complete
- Split the data into complete and incomplete.
- Do the analysis on the complete rows, extracting the PC’s/Factors
- Add the PC/Factor data onto the complete rows using
bind_rowsthe two parts back together.