18.3 Missing Data Mechanisms
Process by which some units observed, some units not observed
- Missing Completely at Random (MCAR): The probability that a data point is missing is completely unrelated (independent) of any observed and unobserved data or parameters.
- P(Y missing| X, Y) = P(Y missing)
- Ex: Miscoding or forgetting to log in answer
- Missing at Random (MAR): The probability that a data point is missing is independent can be explained or modeled by other observed variables.
- P(Y missing|x, Y) = P(Y missing | X)
- Ex: Y = age, X = sex
- Pr (Y miss| X = male) = 0.2
- Pr (Y miss| X = female) = 0.3
- Males people are less likely to fill out an income survey - The missing data on income is related to gender - After accounting for gender the missing data is unrelated to income.
- Not missing at Random (NMAR): The probability that a data point is missing depends on the value of the variable in question.
- P(Y missing | X, Y) = P (Y missing|X, Y)
- Ex: Y = income, X = immigration status
- Richer person may be less willing to disclose income
- Undocumented immigrant may be less willing to disclose income
- Richer person may be less willing to disclose income
- P(Y missing | X, Y) = P (Y missing|X, Y)
Write down an example of each.
Does it matter to inferences? Yes!
18.3.1 Demonstration via Simulation
What follows is just one method of approaching this problem via code. Simulation is a frequently used technique to understand the behavior of a process over time or over repeated samples.
18.3.1.1 MCAR
- Draw a random sample of size 100 from a standard Normal distribution (Z) and calculate the mean.
set.seed(456) # setting a seed ensures the same numbers will be drawn each time
<- rnorm(100)
z <- mean(z)
mean.z
mean.z## [1] 0.1205748
- Delete data at a rate of \(p\) and calculate the complete case (available) mean.
- Sample 100 random Bernoulli (0/1) variables with probability \(p\).
<- rbinom(100, 1, p=.5) x
- Find out which elements are are 1’s
<- which(x==1) delete.these
- Set those elements in
z
toNA
.
<- NA z[delete.these]
- Calculate the complete case mean
mean(z, na.rm=TRUE) ## [1] 0.1377305
- Calculate the bias as the sample mean minus the true mean (\(E(\hat\theta) - \theta\)).
mean(z, na.rm=TRUE) - mean.z
## [1] 0.01715565
How does the bias change as a function of the proportion of missing? Let \(p\) range from 0% to 99% and plot the bias as a function of \(p\).
<- function(p){ # create a function to handle the repeated calculations
calc.bias mean(ifelse(rbinom(100, 1, p)==1, NA, z), na.rm=TRUE) - mean.z
}
<- seq(0,.99,by=.01)
p
plot(c(0,1), c(-1, 1), type="n", ylab="Bias", xlab="Proportion of missing")
points(p, sapply(p, calc.bias), pch=16)
abline(h=0, lty=2, col="blue")
What is the behavior of the bias as \(p\) increases? Look specifically at the position/location of the bias, and the variance/variability of the bias.
18.3.1.3 NMAR: Pure Censoring
Consider a hypothetical blood test to measure a hormone that is normally distributed in the blood with mean 10\(\mu g\) and variance 1. However the test to detect the compound only can detect levels above 10.
<- rnorm(100, 10, 1)
z <- z
y <10] <- NA
y[ymean(z) - mean(y, na.rm=TRUE)
## [1] -0.6850601
Did the complete case estimate over- or under-estimate the true mean?
Degrees of difficulty
- MCAR: is easiest to deal with.
- MAR: we can live with it.
- NMAR: most difficult to handle.
Evidence?
What can we learn from evidence in the data set at hand?
- May be evidence in the data rule out MCAR - test responders vs. nonresponders.
- Example: Responders tend to have higher/lower average education than nonresponders by t-test
- Example: Response more likely in one geographic area than another by chi-square test
- No evidence in data set to rule out MAR (although there may be evidence from an external data source)
What is plausible?
- Cochran example: when human behavior is involved, MCAR must be viewed as an extremely special case that would often be violated in practice
- Missing data may be introduced by design (e.g., measure some variables, don’t measure others for reasons of cost, response burden), in which case MCAR would apply
- MAR is much more common than MCAR
- We cannot be too cavalier about assuming MAR, but anecdotal evidence shows that it often is plausible when conditioning on enough information
Ignorable Missing
- If missing-data mechanism is MCAR or MAR then nonresponse is said to be “ignorable”.
- Origin of name: in likelihood-based inference, both the data model and missing-data mechanism are important but with MCAR or MAR, inference can be based solely on the data model, thus making inference much simpler
- “Ignorability” is a relative assumption: missingness on income may be NMAR given only gender, but may be MAR given gender, age, occupation, region of the country