## 18.5 Imputation Methods

This section demonstrates each imputation method on the bsi_depress scale variable from the parental HIV example. To recap, 37% of the data on this variable is missing.

Create an index of row numbers containing missing values. This will be used to fill in those missing values with a data value.

hot.deck <- sample(na.omit(hiv$bsi_depress), size = length(miss.dep.idx)) bsi_depress.hotdeck[miss.dep.idx] <- hot.deck The distribution of imputed values better matches the distribution of observed data, but the distribution (Q1, Q3) is shifted lower a little bit. ### 18.5.3 Model based imputation • Conditional Mean imputation: Use regression on observed variables to estimate missing values • Predictions only available for cases with no missing covariates • Imputed value is the model predicted mean $$\hat{\mu}_{Y|X}$$ • Could use VIM::regressionImp() function • Predictive Mean Matching: Fills in a value randomly by sampling observed values whose regression-predicted values are closest to the regression-predicted value for the missing point. • Cross between hot-deck and conditional mean • Categorical data can be imputed using classification models • Less biased than mean substitution • but SE’s could be inflated • Typically used in multivariate imputation (so not shown here) Model bsi_depress using gender, siblings and age as predictors using linear regression. reg.model <- lm(bsi_depress ~ gender + siblings + age, hiv) need.imp <- hiv[miss.dep.idx, c("gender", "siblings", "age")] reg.imp.vals <- predict(reg.model, newdata = need.imp) bsi_depress.lm <- hiv$bsi_depress # copy
bsi_depress.lm[miss.dep.idx] <- reg.imp.vals

It seems like only values around 0.5 and 0.8 were imputed values for bsi_depress. The imputed values don’t quite match the distribution of observed values. Regression imputation and PMM seem to perform extremely similarily.

• Impute regression value $$\pm$$ a randomly selected residual based on estimated residual variance
• Over the long-term, we can reduce bias, on the average
set.seed(1337)
rmse <- sqrt(summary(reg.model)$sigma) eps <- rnorm(length(miss.dep.idx), mean=0, sd=rmse) bsi_depress.lm.resid <- hiv$bsi_depress # copy
bsi_depress.lm.resid[miss.dep.idx] <- reg.imp.vals + eps

Well, the distribution of imputed values is spread out a bit more, but the imputations do not respect the truncation at 0 this bsi_depress value has.

### 18.5.5 Comparison of Estimates

Create a table and plot that compares the point estimates and intervals for the average bsi depression scale.

single.imp <- bind_rows(
data.frame(value = na.omit(hiv$bsi_depress), method = "Observed"), data.frame(value = bsi_depress.ums, method = "Mean Sub"), data.frame(value = bsi_depress.hotdeck, method = "Hot Deck"), data.frame(value = bsi_depress.lm, method = "Regression"), data.frame(value = bsi_depress.lm.resid, method = "Reg + eps")) single.imp$method <- forcats::fct_relevel(single.imp\$method ,
c("Observed", "Mean Sub", "Hot Deck", "Regression", "Reg + eps"))

si.ss <- single.imp %>%
group_by(method) %>%
summarize(mean = mean(value),
sd = sd(value),
se = sd/sqrt(n()),
cil = mean-1.96*se,
ciu = mean+1.96*se)
si.ss
## # A tibble: 5 × 6
##   method      mean    sd     se   cil   ciu
##   <fct>      <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 Observed   0.723 0.782 0.0622 0.601 0.844
## 2 Mean Sub   0.723 0.620 0.0391 0.646 0.799
## 3 Hot Deck   0.738 0.783 0.0494 0.641 0.835
## 4 Regression 0.682 0.631 0.0399 0.604 0.760
## 5 Reg + eps  0.753 0.848 0.0536 0.648 0.858
ggviolin(single.imp, y = "value",
fill = "method", x = "method",
alpha = .2)

theme_bw() + xlab("Average BSI Depression score") + ylab("")