18.9 Example: Prescribed amount of missing.
We will demonstrate using Fisher’s Iris data (pre-built in with R) where we can artificially create a prespecified percent of the data missing. This allows us to be able to estimate the bias incurred by using these imputation methods.
For the iris
data we set a seed and use the prodNA()
function from the missForest
package to create 10% missing values in this data set.
library(missForest)
set.seed(12345) # Note to self: Change the combo on my luggage
<- prodNA(iris, noNA=0.1)
iris.mis prop.table(table(is.na(iris.mis)))
##
## FALSE TRUE
## 0.9 0.1
Visualize missing data pattern.
library(VIM)
aggr(iris.mis, col=c('darkolivegreen3','salmon'),
numbers=TRUE, sortVars=TRUE,
labels=names(iris.mis), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Petal.Width 0.14000000
## Sepal.Length 0.13333333
## Species 0.11333333
## Petal.Length 0.06000000
## Sepal.Width 0.05333333
Here’s another example of where only 10% of the data overall is missing, but it results in only 58% complete cases.
18.9.1 Multiply impute the missing data using mice()
<- mice(iris.mis, m=10, maxit=25, meth="pmm", seed=500, printFlag=FALSE)
imp_iris summary(imp_iris)
## Class: mids
## Number of multiple imputations: 10
## Imputation methods:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "pmm" "pmm" "pmm" "pmm" "pmm"
## PredictorMatrix:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Sepal.Length 0 1 1 1 1
## Sepal.Width 1 0 1 1 1
## Petal.Length 1 1 0 1 1
## Petal.Width 1 1 1 0 1
## Species 1 1 1 1 0
The Stack Exchange post listed below has a great explanation/description of what each of these arguments control. It is a very good idea to understand these controls.
https://stats.stackexchange.com/questions/219013/how-do-the-number-of-imputations-the-maximum-iterations-affect-accuracy-in-mul/219049#21904918.9.2 Check the imputation method used on each variable.
$meth
imp_iris## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "pmm" "pmm" "pmm" "pmm" "pmm"
Predictive mean matching was used for all variables, even Species
. This is reasonable because PMM is a hot deck method of imputation.
18.9.3 Check Convergence
plot(imp_iris, c("Sepal.Length", "Sepal.Width", "Petal.Length"))
The variance across chains is no larger than the variance within chains.
18.9.4 Look at the values generated for imputation
$imp$Sepal.Length
imp_iris## 1 2 3 4 5 6 7 8 9 10
## 12 5.4 5.2 5.3 5.0 5.4 4.9 5.1 4.9 5.4 5.4
## 13 4.7 4.4 5.0 4.4 4.6 4.4 4.9 4.9 5.0 4.9
## 14 4.4 4.4 4.7 4.5 4.4 4.9 4.4 4.4 4.4 4.4
## 36 4.4 4.4 4.7 4.9 4.7 4.4 4.4 4.4 4.9 4.7
## 38 5.1 5.2 5.4 5.0 5.1 4.6 4.6 5.0 4.7 5.0
## 40 5.2 5.0 5.4 5.4 4.6 5.0 4.8 5.0 4.6 5.1
## 46 4.9 4.4 4.4 4.9 4.7 4.4 4.7 4.7 5.0 5.5
## 51 6.9 6.4 6.8 6.8 6.7 6.3 6.5 6.4 6.8 6.9
## 56 6.0 5.8 5.4 5.8 6.1 5.4 5.4 5.9 6.0 6.0
## 62 6.3 5.8 6.3 5.8 5.7 5.7 6.8 5.8 5.8 6.2
## 74 6.7 5.4 6.4 6.1 6.9 6.8 6.5 6.1 6.4 6.7
## 75 6.2 6.2 5.6 5.8 6.2 6.0 6.7 6.7 5.8 5.7
## 86 6.4 6.0 6.5 6.9 6.1 6.3 6.9 6.8 6.3 6.5
## 90 5.2 5.6 5.5 5.2 5.2 6.0 6.2 5.2 4.9 5.8
## 91 5.6 6.2 6.2 5.6 6.1 5.8 6.0 6.2 5.8 6.2
## 106 7.7 7.9 7.7 7.3 7.9 7.7 7.3 7.7 7.7 7.7
## 124 6.7 6.2 5.7 5.7 6.2 5.7 5.6 5.7 5.6 5.7
## 142 6.2 6.1 6.0 6.1 6.9 6.6 6.1 6.5 6.6 5.8
## 145 6.3 6.4 6.4 6.7 6.3 6.3 6.4 6.3 6.7 6.5
## 148 6.4 6.5 5.6 6.4 6.7 6.0 6.4 6.5 6.4 5.6
This is just for us to see what this imputed data look like. Each column is an imputed value, each row is a row where an imputation for Sepal.Length
was needed. Notice only imputations are shown, no observed data is showing here.
18.9.5 Create a complete data set by filling in the missing data using the imputations
<- complete(imp_iris, action=1) iris_1
Action=1 returns the first completed data set, action=2 returns the second completed data set, and so on.
18.9.5.1 Alternative - Stack the imputed data sets in long format.
<- complete(imp_iris, 'long') iris_long
By looking at the names
of this new object we can confirm that there are indeed 10 complete data sets with \(n=150\) in each.
names(iris_long)
## [1] ".imp" ".id" "Sepal.Length" "Sepal.Width" "Petal.Length"
## [6] "Petal.Width" "Species"
table(iris_long$.imp)
##
## 1 2 3 4 5 6 7 8 9 10
## 150 150 150 150 150 150 150 150 150 150
18.9.6 Visualize Imputations
Let’s compare the imputed values to the observed values to see if they are indeed “plausible” We want to see that the distribution of of the magenta points (imputed) matches the distribution of the blue ones (observed).
Univariately
densityplot(imp_iris)
Multivariately
xyplot(imp_iris, Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width | Species, cex=.8, pch=16)
Analyze and pool All of this imputation was done so we could actually perform an analysis!
Let’s run a simple linear regression on Sepal.Length
as a function of Sepal.Width
, Petal.Length
and Species
.
<- with(imp_iris, lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species))
model summary(pool(model))
## term estimate std.error statistic df p.value
## 1 (Intercept) 2.3291755 0.29018720 8.026458 100.48280 1.942002e-12
## 2 Sepal.Width 0.4324051 0.09212873 4.693488 88.97127 9.678996e-06
## 3 Petal.Length 0.8165354 0.07163496 11.398560 107.17114 0.000000e+00
## 4 Speciesversicolor -1.0848839 0.24416495 -4.443242 96.69963 2.360206e-05
## 5 Speciesvirginica -1.6001933 0.31893115 -5.017363 103.29187 2.189018e-06
Pooled parameter estimates \(\bar{Q}\) and their standard errors \(\sqrt{T}\) are provided, along with a significance test (against \(\beta_p=0\)). Note with this output that a 95% interval must be calculated manually.
We can leverage the gtsummary
package to tidy and print the results of a mids
object, but the mice object has to be passed to tbl_regression
BEFORE you pool. ref SO post. This function needs to access features of the original model first, then will do the appropriate pooling and tidying.
::tbl_regression(model) gtsummary
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
Sepal.Width | 0.43 | 0.25, 0.62 | <0.001 |
Petal.Length | 0.82 | 0.67, 1.0 | <0.001 |
Species | |||
setosa | — | — | |
versicolor | -1.1 | -1.6, -0.60 | <0.001 |
virginica | -1.6 | -2.2, -1.0 | <0.001 |
1 CI = Confidence Interval |
Additionally digging deeper into the object created by pool(model)
, specifically the pooled
list, we can pull out additional information including the number of missing values, the fraction of missing information (fmi
) as defined by Rubin (1987), and lambda
, the proportion of total variance that is attributable to the missing data (\(\lambda = (B + B/m)/T)\).
kable(pool(model)$pooled[,c(1:4, 8:9)], digits=3)
term | m | estimate | ubar | df | riv |
---|---|---|---|---|---|
(Intercept) | 10 | 2.329 | 0.073 | 100.483 | 0.151 |
Sepal.Width | 10 | 0.432 | 0.007 | 88.971 | 0.193 |
Petal.Length | 10 | 0.817 | 0.005 | 107.171 | 0.129 |
Speciesversicolor | 10 | -1.085 | 0.051 | 96.700 | 0.164 |
Speciesvirginica | 10 | -1.600 | 0.089 | 103.292 | 0.141 |
18.9.7 Calculating bias
The iris data set had no missing data to begin with. So we can calculate the “true” parameter estimates…
<- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, data=iris) true.model
and find the difference in coefficients.
The variance of the multiply imputed estimates is larger because of the between-imputation variance.
library(forestplot)
<- summary(true.model)$coefficients[,1]
te.mean <- summary(pool(model))[,2]
mi.mean <- te.mean - 1.96*summary(true.model)$coefficients[,2]
te.ll <- mi.mean - 1.96*summary(pool(model))[,3]
mi.ll <- te.mean + 1.96*summary(true.model)$coefficients[,2]
te.ul <- mi.mean + 1.96*summary(pool(model))[,3]
mi.ul <- names(coef(true.model))
names
forestplot(names,
legend = c("True Model", "MICE"),
fn.ci_norm = c(fpDrawNormalCI, fpDrawCircleCI),
mean = cbind(te.mean, mi.mean),
lower = cbind(te.ll, mi.ll),
upper = cbind(te.ul, mi.ul),
col=fpColors(box=c("blue", "darkred")),
xlab="Regression coefficients",
boxsize = .1
)