18.5 Imputation Methods

This section demonstrates each imputation method on the bsi_depress scale variable from the parental HIV example. To recap, 37% of the data on this variable is missing.

Create an index of row numbers containing missing values. This will be used to fill in those missing values with a data value.

miss.dep.idx<- which(is.na(hiv$bsi_depress))
head(miss.dep.idx) 
## [1]  2  4  5  9 13 14

For demonstration purposes I will also create a copy of the bsi_depress variable so that the original is not overwritten for each example.

18.5.1 Unconditional mean substitution.

Impute all missing data using the mean of observed cases
Artificially decreases the variance

bsi_depress.ums <- hiv$bsi_depress # copy
complete.case.mean <- mean(hiv$bsi_depress, na.rm=TRUE)
bsi_depress.ums[miss.dep.idx] <- complete.case.mean

Only a single value was used to impute missing data.

18.5.2 Hot deck imputation

- Impute values by randomly sampling values from observed data.  
- Good for categorical data
- Reasonable for MCAR and MAR
- `hotdeck` function in `VIM` available

bsi_depress.hotdeck<- hiv$bsi_depress # copy
hot.deck <- sample(na.omit(hiv$bsi_depress), size = length(miss.dep.idx))
bsi_depress.hotdeck[miss.dep.idx] <- hot.deck

The distribution of imputed values better matches the distribution of observed data, but the distribution (Q1, Q3) is shifted lower a little bit.

18.5.3 Model based imputation

Conditional Mean imputation: Use regression on observed variables to estimate missing values
- Predictions only available for cases with no missing covariates
- Imputed value is the model predicted mean \(\hat{\mu}_{Y|X}\)
- Could use VIM::regressionImp() function
Predictive Mean Matching: Fills in a value randomly by sampling observed values whose regression-predicted values are closest to the regression-predicted value for the missing point.
- Cross between hot-deck and conditional mean
- Categorical data can be imputed using classification models
- Less biased than mean substitution
- but SE’s could be inflated
- Typically used in multivariate imputation (so not shown here)

Model bsi_depress using gender, siblings and age as predictors using linear regression.

reg.model <- lm(bsi_depress ~ gender + siblings + age, hiv) 
need.imp  <- hiv[miss.dep.idx, c("gender", "siblings", "age")]
reg.imp.vals <- predict(reg.model, newdata = need.imp)
bsi_depress.lm <- hiv$bsi_depress # copy
bsi_depress.lm[miss.dep.idx] <- reg.imp.vals

It seems like only values around 0.5 and 0.8 were imputed values for bsi_depress. The imputed values don’t quite match the distribution of observed values. Regression imputation and PMM seem to perform extremely similarily.

18.5.4 Adding a residual

Impute regression value \(\pm\) a randomly selected residual based on estimated residual variance
Over the long-term, we can reduce bias, on the average

set.seed(1337)
rmse <- sqrt(summary(reg.model)$sigma)
eps <- rnorm(length(miss.dep.idx), mean=0, sd=rmse)
bsi_depress.lm.resid <- hiv$bsi_depress # copy
bsi_depress.lm.resid[miss.dep.idx] <- reg.imp.vals + eps

Well, the distribution of imputed values is spread out a bit more, but the imputations do not respect the truncation at 0 this bsi_depress value has.

18.5.5 Comparison of Estimates

Create a table and plot that compares the point estimates and intervals for the average bsi depression scale.

single.imp <- bind_rows(
data.frame(value = na.omit(hiv$bsi_depress),  method = "Observed"),
  data.frame(value = bsi_depress.ums, method = "Mean Sub"), 
  data.frame(value = bsi_depress.hotdeck, method = "Hot Deck"), 
  data.frame(value = bsi_depress.lm, method = "Regression"), 
  data.frame(value = bsi_depress.lm.resid, method = "Reg + eps"))

single.imp$method <- forcats::fct_relevel(single.imp$method , 
      c("Observed", "Mean Sub", "Hot Deck", "Regression", "Reg + eps"))

si.ss <- single.imp %>%
  group_by(method) %>%
  summarize(mean = mean(value), 
            sd = sd(value), 
            se = sd/sqrt(n()), 
            cil = mean-1.96*se, 
            ciu = mean+1.96*se)
si.ss
## # A tibble: 5 × 6
##   method      mean    sd     se   cil   ciu
##   <fct>      <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 Observed   0.723 0.782 0.0622 0.601 0.844
## 2 Mean Sub   0.723 0.620 0.0391 0.646 0.799
## 3 Hot Deck   0.738 0.783 0.0494 0.641 0.835
## 4 Regression 0.682 0.631 0.0399 0.604 0.760
## 5 Reg + eps  0.753 0.848 0.0536 0.648 0.858

ggviolin(single.imp, y = "value", 
          fill = "method", x = "method", 
          add = "boxplot", 
          alpha = .2)


ggplot(si.ss, aes(x=mean, y = method, col=method)) + 
  geom_point() + geom_errorbar(aes(xmin=cil, xmax=ciu), width=0.2) + 
  scale_x_continuous(limits=c(.5, 1)) + 
  theme_bw() + xlab("Average BSI Depression score") + ylab("")

…but we can do better.