Multiple Imputation – Understanding the Formula for Between-Imputation Variance

multiple-imputationstandard errorvariance

Multiple Imputation (MI) for estimating desired a desired statistic but with missing data

Following ^Shafer (page 4), and ^Austin et al. (section "Analyses in the M imputed data sets"), which give a primer on Rubin's book for MI.

Let

  • $Y$ be our data, and break it into $Y_{obs}$ (observed/existing), $Y_{mis}$ (missing subset).
  • $Q(Y)$ be the variable of interest, to be computed from our data.
  • We need that $Q(Y)$ is a normal random variable for the following all to hold.

Method:

  • Find a way to simulate $Y_{mis}$, and generate $\{Y_{mis}^{(i)}\}_{i = 1}^m$, $m$ samples.
  • Compute $Q^i := Q(Y_{obs}, Y_{mis}^{(i)})$
  • Compute $\bar{Q} := \sum_i Q^i / m$. This is our estimate!

We can go further and get t statistic confidence intervals/hypothesis tests on how well $\bar{Q}$ approximates $Q$.

Required is the variance of $Q$:

  1. Recall Total Variance formula: $var(Q) = E_Y( var(Q|Y) ) + var_Y( E(Q|Y))$.
  2. We have to know the $U:= var(Q|Y)$'s formula.
  3. Compute $U^{i}:= var(Q^i|Y^i)$
  4. Estimate $E_Y( var(Q|Y) )$ by computing $\bar{U}:= \sum U^i/m$.
    • This is the first term of the total variance formula
    • This is called average within-imputation variance
  5. $\bar{Q}:= \sum(Q^i)/m $, our estimate of $E(Q|Y)$ is already computed.
  6. Compute $B:= (m-1)^{-1}\sum (Q^i-\bar{Q})^2$ to estimate $var( \bar{Q})$
    • This is called between-imputation variance
  7. Finally, the desired estimate for $T:= Var(Q) = (1+(1/m))B + \bar{U}$

Questions here!

Questions on step 5:

  • Isn't $B$ the sample variance for $\{Q^i\}$, which should estimate $var(Q(Y)|Y)$ which is just $U$?
  • I think what we really want when we say $var_Y( E(Q|Y)$ is the square of the sample error of the mean, which is $B/m$. This follows ^StandardError:

The standard error [of the mean] is, by definition, the standard deviation of $ \bar{x}$ which is simply the square root of the variance:
$\sigma_{\bar {x}} = \sqrt { \frac{\sigma^{2}}{n} } = \frac{\sigma}{ \sqrt{n}}$

When focusing on a single statistic, the Monte Carlo error can be computed as $\sqrt{B/M}$.

So this seems to confirm my logic above. If so, then is between-imputation variance $B$ or $B/M$ or $(1+1/m)B$ ?

Questions on step 6:

  • It seems to me that the total variance formula says $var(Q) = B/M + \bar{U}$. Am I right?
  • If so, why does the formula in these papers multiply $B$ by $(1+1/m)$ and use $var(Q) = B + B/m + \bar{U}$?

The paper's go on to say we can do Student T Test confidence intervals:

  1. $(Q-\bar{Q})/T \sim $ Student T statistic with $\nu = (m-1)[1 + \frac{\bar{U}}{(1+(1/m))B}]^2$
  2. This means $\bar{Q} \pm AT$ is a $k$ confidence interval if $A$ is the two-sided t-value at $(1-k, \nu)$.

Question: If $m$ is sufficiently large (unsure, maybe 10K?) couldn't we use a z-score/Normal for our tests since Q(Y) normal?

General Question
Can you check my understanding of why we cannot just use $ \sum(Q^i – \bar{Q})^2/(m-1)$ as the estimate of variance of $Q$. Because if $Y_{mis}$ doesn't effect $Q$ at all, then we'd get variance of 0 with this computation. We need $U$ to account for the variance that comes from $Q$ on $Y_{obs}$. Is that right?

Best Answer

The total variance formula you would like to use is based on an infinitely large number of imputations, with $\bar U_{\infty}$ and $B_{\infty}$ the within- and between-imputation variance. The estimate of the within-imputation variance $\bar U$ based on a finite number $m$ of imputations is unbiased, but there's a bias if you substitute the finite-imputation $B$ for $B_{\infty}$. Quoting Section 2.3.2 of van Buuren's Flexible Imputation of Missing Data:

It is tempting to conclude that the total variance $T$ is equal to the sum of $\bar U$ and $B$, but that would be incorrect. We need to incorporate the fact that $\bar Q$ itself is estimated using finite $m$, and thus only approximates $Q_{\infty}$. Rubin shows that the contribution to the variance of this factor is systematic and equal to $B_{\infty}/m$.

With $B$ providing an approximation to $B_{\infty}$, you get the 3rd contribution to the total variance, $B/m$, that you are questioning.