18.9 Example: Prescribed amount of missing.

We will demonstrate using Fisher’s Iris data (pre-built in with R) where we can artificially create a prespecified percent of the data missing. This allows us to be able to estimate the bias incurred by using these imputation methods.

For the iris data we set a seed and use the prodNA() function from the missForest package to create 10% missing values in this data set.

Visualize missing data pattern.

##  Variables sorted by number of missings: 
##      Variable      Count
##   Petal.Width 0.14000000
##  Sepal.Length 0.13333333
##       Species 0.11333333
##  Petal.Length 0.06000000
##   Sepal.Width 0.05333333

Here’s another example of where only 10% of the data overall is missing, but it results in only 58% complete cases.

18.9.2 Check the imputation method used on each variable.

Predictive mean matching was used for all variables, even Species. This is reasonable because PMM is a hot deck method of imputation.

18.9.3 Check Convergence

The variance across chains is no larger than the variance within chains.

18.9.5 Create a complete data set by filling in the missing data using the imputations

Action=1 returns the first completed data set, action=2 returns the second completed data set, and so on. Alternative - Stack the imputed data sets in long format.

By looking at the names of this new object we can confirm that there are indeed 10 complete data sets with \(n=150\) in each.

18.9.6 Visualize Imputations

Let’s compare the imputed values to the observed values to see if they are indeed “plausible” We want to see that the distribution of of the magenta points (imputed) matches the distribution of the blue ones (observed).



Analyze and pool All of this imputation was done so we could actually perform an analysis!

Let’s run a simple linear regression on Sepal.Length as a function of Sepal.Width, Petal.Length and Species.

Pooled parameter estimates \(\bar{Q}\) and their standard errors \(\sqrt{T}\) are provided, along with a significance test (against \(\beta_p=0\)). Note that a 95% interval must be calculated manually.

Digging deeper into the object created by pool(model), specifically the pooled list, we can pull out additional information including the number of missing values, the fraction of missing information (fmi) as defined by Rubin (1987), and lambda, the proportion of total variance that is attributable to the missing data (\(\lambda = (B + B/m)/T)\).

estimate ubar b t lambda fmi
(Intercept) 2.355 0.073 0.016 0.090 0.191 0.211
Sepal.Width 0.430 0.007 0.001 0.009 0.173 0.192
Petal.Length 0.804 0.004 0.000 0.005 0.081 0.096
Speciesversicolor -1.039 0.051 0.006 0.057 0.111 0.127
Speciesvirginica -1.557 0.088 0.006 0.095 0.071 0.085