Multiple Imputation (MI) for estimating desired a desired statistic but with missing data
Following ^Shafer (page 4), and ^Austin et al. (section "Analyses in the M imputed data sets"), which give a primer on Rubin's book for MI.
Let
- $Y$ be our data, and break it into $Y_{obs}$ (observed/existing), $Y_{mis}$ (missing subset).
- $Q(Y)$ be the variable of interest, to be computed from our data.
- We need that $Q(Y)$ is a normal random variable for the following all to hold.
Method:
- Find a way to simulate $Y_{mis}$, and generate $\{Y_{mis}^{(i)}\}_{i = 1}^m$, $m$ samples.
- Compute $Q^i := Q(Y_{obs}, Y_{mis}^{(i)})$
- Compute $\bar{Q} := \sum_i Q^i / m$. This is our estimate!
We can go further and get t statistic confidence intervals/hypothesis tests on how well $\bar{Q}$ approximates $Q$.
Required is the variance of $Q$:
- Recall Total Variance formula: $var(Q) = E_Y( var(Q|Y) ) + var_Y( E(Q|Y))$.
- We have to know the $U:= var(Q|Y)$'s formula.
- Compute $U^{i}:= var(Q^i|Y^i)$
- Estimate $E_Y( var(Q|Y) )$ by computing $\bar{U}:= \sum U^i/m$.
- This is the first term of the total variance formula
- This is called average within-imputation variance
- $\bar{Q}:= \sum(Q^i)/m $, our estimate of $E(Q|Y)$ is already computed.
- Compute $B:= (m-1)^{-1}\sum (Q^i-\bar{Q})^2$ to estimate $var( \bar{Q})$
- This is called between-imputation variance
- Finally, the desired estimate for $T:= Var(Q) = (1+(1/m))B + \bar{U}$
Questions here!
Questions on step 5:
- Isn't $B$ the sample variance for $\{Q^i\}$, which should estimate $var(Q(Y)|Y)$ which is just $U$?
- I think what we really want when we say $var_Y( E(Q|Y)$ is the square of the sample error of the mean, which is $B/m$. This follows ^StandardError:
The standard error [of the mean] is, by definition, the standard deviation of $ \bar{x}$ which is simply the square root of the variance:
$\sigma_{\bar {x}} = \sqrt { \frac{\sigma^{2}}{n} } = \frac{\sigma}{ \sqrt{n}}$
- ^Austin et al. go on to say
When focusing on a single statistic, the Monte Carlo error can be computed as $\sqrt{B/M}$.
So this seems to confirm my logic above. If so, then is between-imputation variance $B$ or $B/M$ or $(1+1/m)B$ ?
Questions on step 6:
- It seems to me that the total variance formula says $var(Q) = B/M + \bar{U}$. Am I right?
- If so, why does the formula in these papers multiply $B$ by $(1+1/m)$ and use $var(Q) = B + B/m + \bar{U}$?
The paper's go on to say we can do Student T Test confidence intervals:
- $(Q-\bar{Q})/T \sim $ Student T statistic with $\nu = (m-1)[1 + \frac{\bar{U}}{(1+(1/m))B}]^2$
- This means $\bar{Q} \pm AT$ is a $k$ confidence interval if $A$ is the two-sided t-value at $(1-k, \nu)$.
Question: If $m$ is sufficiently large (unsure, maybe 10K?) couldn't we use a z-score/Normal for our tests since Q(Y) normal?
General Question
Can you check my understanding of why we cannot just use $ \sum(Q^i – \bar{Q})^2/(m-1)$ as the estimate of variance of $Q$. Because if $Y_{mis}$ doesn't effect $Q$ at all, then we'd get variance of 0 with this computation. We need $U$ to account for the variance that comes from $Q$ on $Y_{obs}$. Is that right?
Best Answer
The total variance formula you would like to use is based on an infinitely large number of imputations, with $\bar U_{\infty}$ and $B_{\infty}$ the within- and between-imputation variance. The estimate of the within-imputation variance $\bar U$ based on a finite number $m$ of imputations is unbiased, but there's a bias if you substitute the finite-imputation $B$ for $B_{\infty}$. Quoting Section 2.3.2 of van Buuren's Flexible Imputation of Missing Data:
With $B$ providing an approximation to $B_{\infty}$, you get the 3rd contribution to the total variance, $B/m$, that you are questioning.