18.6 Multiple Imputation (MI)

18.6.1 Goals

  • Accurately reflect available information
  • Avoid bias in estimates of quantities of interest
  • Estimation could involve explicit or implicit model
  • Accurately reflect uncertainty due to missingness

18.6.2 Technique

  1. For each missing value, impute \(m\) estimates (usually \(m\) = 5)
    • Imputation method must include a random component
  2. Create \(m\) complete data sets
  3. Perform desired analysis on each of the \(m\) complete data sets
  4. Pool final estimates in a manner that accounts for the between, and within imputation variance.
Diagram of Multiple Imputation process. Credit: https://stefvanbuuren.name/fimd/sec-nutshell.html
Diagram of Multiple Imputation process. Credit: https://stefvanbuuren.name/fimd/sec-nutshell.html

18.6.3 MI as a paradigm

  • Logic: “Average over” uncertainty, don’t assume most likely scenario (single imputation) covers all plausible scenarios
  • Principle: Want nominal 95% intervals to cover targets of estimation 95% of the time
  • Simulation studies show that, when MAR assumption holds:
    • Proper imputations will yield close to nominal coverage (Rubin 87)
    • Improvement over single imputation is meaningful
    • Number of imputations can be modest - even 2 adequate for many purposes, so 5 is plenty

Rubin 87: Multiple Imputation for Nonresponse in Surveys, Wiley, 1987).

18.6.4 Inference on MI (Pooling estimates)

Consider \(m\) imputed data sets. For some quantity of interest \(Q\) with squared \(SE = U\), calculate \(Q_{1}, Q_{2}, \ldots, Q_{m}\) and \(U_{1}, U_{2}, \ldots, U_{m}\) (e.g., carry out \(m\) regression analyses, obtain point estimates and SE from each).

Then calculate the average estimate \(\bar{Q}\), the average variance \(\bar{U}\), and the variance of the averages \(B\).

\[ \begin{aligned} \bar{Q} & = \sum^{m}_{i=1}Q_{i}/m \\ \bar{U} & = \sum^{m}_{i=1}U_{i}/m \\ B & = \frac{1}{m-1}\sum^{m}_{i=1}(Q_{i}-\bar{Q})^2 \end{aligned} \]

Then \(T = \bar{U} + \frac{m+1}{m}B\) is the estimated total variance of \(\bar{Q}\).

Significance tests and interval estimates can be based on

\[\frac{\bar{Q}-Q}{\sqrt{T}} \sim t_{df}, \mbox{ where } df = (m-1)(1+\frac{1}{m+1}\frac{\bar{U}}{B})^2\]

  • df are similar to those for comparison of normal means with unequal variances, i.e., using Satterthwaite approximation.
  • Ratio of (B = between-imputation variance) to (T = between + within-imputation variance) is known as the fraction of missing information (FMI).
    • The FMI has been proposed as a way to monitor ongoing data collection and estimate the potential bias resulting from survey non-responders Wagner, 2018

18.6.5 Example

  1. Create \(m\) imputed datasets using linear regression plus a small amount of random noise so all the imputed values are not identical
set.seed(1061)
dep.imp1 <- dep.imp2 <- dep.imp3 <- regressionImp(bsi_depress ~ gender + siblings + age, hiv) 
dep.imp1$bsi_depress[miss.dep.idx] <- dep.imp1$bsi_depress[miss.dep.idx] +
  rnorm(length(miss.dep.idx), mean=0, sd=rmse/2)

dep.imp2$bsi_depress[miss.dep.idx] <- dep.imp2$bsi_depress[miss.dep.idx] + 
  rnorm(length(miss.dep.idx), mean=0, sd=rmse/2)

dep.imp3$bsi_depress[miss.dep.idx] <- dep.imp3$bsi_depress[miss.dep.idx] + 
  rnorm(length(miss.dep.idx), mean=0, sd=rmse/2)

Visualize the distributions of observed and imputed

dep.mi <- bind_rows(
  data.frame(value = dep.imp1$bsi_depress, imputed = dep.imp1$bsi_depress_imp, 
             imp = "dep.imp1"), 
  data.frame(value = dep.imp2$bsi_depress, imputed = dep.imp2$bsi_depress_imp, 
             imp ="dep.imp2"), 
  data.frame(value = dep.imp3$bsi_depress, imputed = dep.imp3$bsi_depress_imp, 
             imp ="dep.imp3"))

ggdensity(dep.mi, x = "value", color = "imputed", fill = "imputed", 
          add = "mean", rug=TRUE, palette = "jco") + 
  facet_wrap(~imp, ncol=1)

  1. Calculate the point estimate \(Q\) and the variance \(U\) from each imputation.
(Q <- c(mean(dep.imp1$bsi_depress), 
        mean(dep.imp2$bsi_depress), 
        mean(dep.imp3$bsi_depress)))
## [1] 0.6700139 0.6808710 0.6694280

n.d <- length(dep.imp1$bsi_depress)
(U <- c(sd(dep.imp1$bsi_depress)/sqrt(n.d), 
        sd(dep.imp2$bsi_depress)/sqrt(n.d), 
        sd(dep.imp3$bsi_depress)/sqrt(n.d)))
## [1] 0.04443704 0.04324317 0.04365866
  1. Pool estimates and calculate a 95% CI
Q.bar <- mean(Q)          # average estimate
U.bar <- mean(U)           # average variance
B <- sd(Q)                 # variance of averages
Tv <- U.bar + ((3+1)/3)*B  # Total variance of estimate

df <- 2*(1+(U.bar/(4*B))^2) # degress of freedom
t95 <- qt(.975, df) # critical value for 95% CI

mi.ss <- data.frame(
  method = "MI Reg", 
  mean = Q.bar, 
  se = sqrt(Tv), 
  cil = Q.bar - t95*sqrt(Tv),
  ciu = Q.bar + t95*sqrt(Tv))

(imp.ss <- bind_rows(si.ss, mi.ss))
## # A tibble: 6 × 6
##   method      mean     sd     se   cil   ciu
##   <chr>      <dbl>  <dbl>  <dbl> <dbl> <dbl>
## 1 Observed   0.723  0.782 0.0622 0.601 0.844
## 2 Mean Sub   0.723  0.620 0.0391 0.646 0.799
## 3 Hot Deck   0.738  0.783 0.0494 0.641 0.835
## 4 Regression 0.682  0.631 0.0399 0.604 0.760
## 5 Reg + eps  0.753  0.848 0.0536 0.648 0.858
## 6 MI Reg     0.673 NA     0.229  0.143 1.20
ggplot(imp.ss, aes(x=mean, y = method, col=method)) + 
  geom_point() + geom_errorbar(aes(xmin=cil, xmax=ciu), width=0.2) + 
  scale_x_continuous(limits=c(-.3, 2)) + 
  theme_bw() + xlab("Average BSI Depression score") + ylab("")