# Multiple Imputation (MI) for estimating desired a desired statistic but with missing data

Following ^Shafer (page 4), and ^Austin et al. (section "Analyses in the M imputed data sets"), which give a primer on Rubin's book for MI.

Let

- $Y$ be our data, and break it into $Y_{obs}$ (observed/existing), $Y_{mis}$ (missing subset).
- $Q(Y)$ be the variable of interest, to be computed from our data.
- We need that $Q(Y)$ is a normal random variable for the following all to hold.

Method:

- Find a way to simulate $Y_{mis}$, and generate $\{Y_{mis}^{(i)}\}_{i = 1}^m$, $m$ samples.
- Compute $Q^i := Q(Y_{obs}, Y_{mis}^{(i)})$
- Compute $\bar{Q} := \sum_i Q^i / m$.
*This is our estimate!*

We can go further and get *t* statistic confidence intervals/hypothesis tests on how well $\bar{Q}$ approximates $Q$.

Required is the variance of $Q$:

- Recall Total Variance formula: $var(Q) = E_Y( var(Q|Y) ) + var_Y( E(Q|Y))$.
- We have to
*know*the $U:= var(Q|Y)$'s formula. - Compute $U^{i}:= var(Q^i|Y^i)$
- Estimate $E_Y( var(Q|Y) )$ by computing $\bar{U}:= \sum U^i/m$.
*This is the first term of the total variance formula*- This is called
*average within-imputation variance*

- $\bar{Q}:= \sum(Q^i)/m $, our estimate of $E(Q|Y)$ is already computed.
- Compute $B:= (m-1)^{-1}\sum (Q^i-\bar{Q})^2$ to estimate $var( \bar{Q})$
- This is called
*between-imputation variance*

- This is called
- Finally, the desired estimate for $T:= Var(Q) = (1+(1/m))B + \bar{U}$

**Questions here!**

**Questions on step 5:**

- Isn't $B$ the sample variance for $\{Q^i\}$, which should estimate $var(Q(Y)|Y)$ which is just $U$?
- I think what we really want when we say $var_Y( E(Q|Y)$ is the
*square of the sample error of the mean*, which is $B/m$. This follows ^StandardError:

The standard error [of the mean] is, by definition, the standard deviation of $ \bar{x}$ which is simply the square root of the variance:

$\sigma_{\bar {x}} = \sqrt { \frac{\sigma^{2}}{n} } = \frac{\sigma}{ \sqrt{n}}$

- ^Austin et al. go on to say

When focusing on a single statistic, the Monte Carlo error can be computed as $\sqrt{B/M}$.

So this seems to confirm my logic above. If so, then is *between-imputation variance* $B$ or $B/M$ or $(1+1/m)B$ ?

**Questions on step 6:**

- It seems to me that the total variance formula says $var(Q) = B/M + \bar{U}$. Am I right?
- If so, why does the formula in these papers multiply $B$ by $(1+1/m)$ and use $var(Q) = B + B/m + \bar{U}$?

The paper's go on to say we can do Student T Test confidence intervals:

- $(Q-\bar{Q})/T \sim $ Student T statistic with $\nu = (m-1)[1 + \frac{\bar{U}}{(1+(1/m))B}]^2$
- This means $\bar{Q} \pm AT$ is a $k$ confidence interval if $A$ is the two-sided t-value at $(1-k, \nu)$.

Question: If $m$ is sufficiently large (unsure, maybe 10K?) couldn't we use a z-score/Normal for our tests since Q(Y) normal?

**General Question**

Can you check my understanding of why we cannot just use $ \sum(Q^i – \bar{Q})^2/(m-1)$ as the estimate of variance of $Q$. Because if $Y_{mis}$ doesn't effect $Q$ at all, then we'd get variance of 0 with this computation. We need $U$ to account for the variance that comes from $Q$ on $Y_{obs}$. Is that right?

## Best Answer

The total variance formula you would like to use is based on an infinitely large number of imputations, with $\bar U_{\infty}$ and $B_{\infty}$ the within- and between-imputation variance. The estimate of the within-imputation variance $\bar U$ based on a finite number $m$ of imputations is unbiased, but there's a bias if you substitute the finite-imputation $B$ for $B_{\infty}$. Quoting Section 2.3.2 of van Buuren's Flexible Imputation of Missing Data:

With $B$ providing an approximation to $B_{\infty}$, you get the 3rd contribution to the total variance, $B/m$, that you are questioning.