18.6 Multiple Imputation (MI)

18.6.1 Goals

  • Accurately reflect available information
  • Avoid bias in estimates of quantities of interest
  • Estimation could involve explicit or implicit model
  • Accurately reflect uncertainty due to missingness

18.6.2 Technique

  1. For each missing value, impute \(m\) estimates (usually \(m\) = 5)
    • Imputation method must include a random component
  2. Create \(m\) complete data sets
  3. Perform desired analysis on each of the \(m\) complete data sets
  4. Combine final estimates in a manner that accounts for the between, and within imputation variance.

Diagram of Multiple Imputation process. Credit: https://stefvanbuuren.name/fimd/sec-nutshell.html

18.6.3 MI as a paradigm

  • Logic: “Average over” uncertainty, don’t assume most likely scenario (single imputation) covers all plausible scenarios
  • Principle: Want nominal 95% intervals to cover targets of estimation 95% of the time
  • Simulation studies show that, when MAR assumption holds:
    • Proper imputations will yield close to nominal coverage (Rubin 87)
    • Improvement over single imputation is meaningful
    • Number of imputations can be modest - even 2 adequate for many purposes, so 5 is plenty

Rubin 87: Multiple Imputation for Nonresponse in Surveys, Wiley, 1987).

18.6.4 Inference on MI

Consider \(m\) imputed data sets. For some quantity of interest \(Q\) with squared \(SE = U\), calculate \(Q_{1}, Q_{2}, \ldots, Q_{m}\) and \(U_{1}, U_{2}, \ldots, U_{m}\) (e.g., carry out \(m\) regression analyses, obtain point estimates and SE from each).

Then calculate the average estimate \(\bar{Q}\), the average variance \(\bar{U}\), and the variance of the averages \(B\).

\[ \begin{aligned} \bar{Q} & = \sum^{m}_{i=1}Q_{i}/m \\ \bar{U} & = \sum^{m}_{i=1}U_{i}/m \\ B & = \frac{1}{m-1}\sum^{m}_{i=1}(Q_{i}-\bar{Q})^2 \end{aligned} \]

Then \(T = \bar{U} + \frac{m+1}{m}B\) is the estimated total variance of \(\bar{Q}\).

Significance tests and interval estimates can be based on

\[\frac{\bar{Q}-Q}{\sqrt{T}} \sim t_{df}, \mbox{ where } df = (m-1)(1+\frac{1}{m+1}\frac{\bar{U}}{B})^2\]

  • df are similar to those for comparison of normal means with unequal variances, i.e., using Satterthwaite approximation.
  • Ratio of (B = between-imputation variance) to (T = between + within-imputation variance) is known as the fraction of missing information (FMI).
    • The FMI has been proposed as a way to monitor ongoing data collection and estimate the potential bias resulting from survey non-responders Wagner, 2018