Multiple Imputation – Understanding the Formula for Between-Imputation Variance

multiple-imputationstandard errorvariance

Multiple Imputation (MI) for estimating desired a desired statistic but with missing data

Following ^Shafer (page 4), and ^Austin et al. (section "Analyses in the M imputed data sets"), which give a primer on Rubin's book for MI.

Let

• $$Y$$ be our data, and break it into $$Y_{obs}$$ (observed/existing), $$Y_{mis}$$ (missing subset).
• $$Q(Y)$$ be the variable of interest, to be computed from our data.
• We need that $$Q(Y)$$ is a normal random variable for the following all to hold.

Method:

• Find a way to simulate $$Y_{mis}$$, and generate $$\{Y_{mis}^{(i)}\}_{i = 1}^m$$, $$m$$ samples.
• Compute $$Q^i := Q(Y_{obs}, Y_{mis}^{(i)})$$
• Compute $$\bar{Q} := \sum_i Q^i / m$$. This is our estimate!

We can go further and get t statistic confidence intervals/hypothesis tests on how well $$\bar{Q}$$ approximates $$Q$$.

Required is the variance of $$Q$$:

1. Recall Total Variance formula: $$var(Q) = E_Y( var(Q|Y) ) + var_Y( E(Q|Y))$$.
2. We have to know the $$U:= var(Q|Y)$$'s formula.
3. Compute $$U^{i}:= var(Q^i|Y^i)$$
4. Estimate $$E_Y( var(Q|Y) )$$ by computing $$\bar{U}:= \sum U^i/m$$.
• This is the first term of the total variance formula
• This is called average within-imputation variance
5. $$\bar{Q}:= \sum(Q^i)/m$$, our estimate of $$E(Q|Y)$$ is already computed.
6. Compute $$B:= (m-1)^{-1}\sum (Q^i-\bar{Q})^2$$ to estimate $$var( \bar{Q})$$
• This is called between-imputation variance
7. Finally, the desired estimate for $$T:= Var(Q) = (1+(1/m))B + \bar{U}$$

Questions here!

Questions on step 5:

• Isn't $$B$$ the sample variance for $$\{Q^i\}$$, which should estimate $$var(Q(Y)|Y)$$ which is just $$U$$?
• I think what we really want when we say $$var_Y( E(Q|Y)$$ is the square of the sample error of the mean, which is $$B/m$$. This follows ^StandardError:

The standard error [of the mean] is, by definition, the standard deviation of $$\bar{x}$$ which is simply the square root of the variance:
$$\sigma_{\bar {x}} = \sqrt { \frac{\sigma^{2}}{n} } = \frac{\sigma}{ \sqrt{n}}$$

When focusing on a single statistic, the Monte Carlo error can be computed as $$\sqrt{B/M}$$.

So this seems to confirm my logic above. If so, then is between-imputation variance $$B$$ or $$B/M$$ or $$(1+1/m)B$$ ?

Questions on step 6:

• It seems to me that the total variance formula says $$var(Q) = B/M + \bar{U}$$. Am I right?
• If so, why does the formula in these papers multiply $$B$$ by $$(1+1/m)$$ and use $$var(Q) = B + B/m + \bar{U}$$?

The paper's go on to say we can do Student T Test confidence intervals:

1. $$(Q-\bar{Q})/T \sim$$ Student T statistic with $$\nu = (m-1)[1 + \frac{\bar{U}}{(1+(1/m))B}]^2$$
2. This means $$\bar{Q} \pm AT$$ is a $$k$$ confidence interval if $$A$$ is the two-sided t-value at $$(1-k, \nu)$$.

Question: If $$m$$ is sufficiently large (unsure, maybe 10K?) couldn't we use a z-score/Normal for our tests since Q(Y) normal?

General Question
Can you check my understanding of why we cannot just use $$\sum(Q^i – \bar{Q})^2/(m-1)$$ as the estimate of variance of $$Q$$. Because if $$Y_{mis}$$ doesn't effect $$Q$$ at all, then we'd get variance of 0 with this computation. We need $$U$$ to account for the variance that comes from $$Q$$ on $$Y_{obs}$$. Is that right?

The total variance formula you would like to use is based on an infinitely large number of imputations, with $$\bar U_{\infty}$$ and $$B_{\infty}$$ the within- and between-imputation variance. The estimate of the within-imputation variance $$\bar U$$ based on a finite number $$m$$ of imputations is unbiased, but there's a bias if you substitute the finite-imputation $$B$$ for $$B_{\infty}$$. Quoting Section 2.3.2 of van Buuren's Flexible Imputation of Missing Data:
It is tempting to conclude that the total variance $$T$$ is equal to the sum of $$\bar U$$ and $$B$$, but that would be incorrect. We need to incorporate the fact that $$\bar Q$$ itself is estimated using finite $$m$$, and thus only approximates $$Q_{\infty}$$. Rubin shows that the contribution to the variance of this factor is systematic and equal to $$B_{\infty}/m$$.
With $$B$$ providing an approximation to $$B_{\infty}$$, you get the 3rd contribution to the total variance, $$B/m$$, that you are questioning.