Multiple Imputation – Understanding the Formula for Between-Imputation Variance

multiple-imputationstandard errorvariance

Multiple Imputation (MI) for estimating desired a desired statistic but with missing data

Following ^Shafer (page 4), and ^Austin et al. (section "Analyses in the M imputed data sets"), which give a primer on Rubin's book for MI.

Let

$Y$ be our data, and break it into $Y_{obs}$ (observed/existing), $Y_{mis}$ (missing subset).
$Q(Y)$ be the variable of interest, to be computed from our data.
We need that $Q(Y)$ is a normal random variable for the following all to hold.

Method:

Find a way to simulate $Y_{mis}$, and generate $\{Y_{mis}^{(i)}\}_{i = 1}^m$, $m$ samples.
Compute $Q^i := Q(Y_{obs}, Y_{mis}^{(i)})$
Compute $\bar{Q} := \sum_i Q^i / m$. This is our estimate!

We can go further and get t statistic confidence intervals/hypothesis tests on how well $\bar{Q}$ approximates $Q$.

Required is the variance of $Q$:

Recall Total Variance formula: $var(Q) = E_Y( var(Q|Y) ) + var_Y( E(Q|Y))$.
We have to know the $U:= var(Q|Y)$'s formula.
Compute $U^{i}:= var(Q^i|Y^i)$
Estimate $E_Y( var(Q|Y) )$ by computing $\bar{U}:= \sum U^i/m$.
- This is the first term of the total variance formula
- This is called average within-imputation variance
$\bar{Q}:= \sum(Q^i)/m $, our estimate of $E(Q|Y)$ is already computed.
Compute $B:= (m-1)^{-1}\sum (Q^i-\bar{Q})^2$ to estimate $var( \bar{Q})$
- This is called between-imputation variance
Finally, the desired estimate for $T:= Var(Q) = (1+(1/m))B + \bar{U}$

Questions here!

Questions on step 5:

Isn't $B$ the sample variance for $\{Q^i\}$, which should estimate $var(Q(Y)|Y)$ which is just $U$?
I think what we really want when we say $var_Y( E(Q|Y)$ is the square of the sample error of the mean, which is $B/m$. This follows ^StandardError:

The standard error [of the mean] is, by definition, the standard deviation of $ \bar{x}$ which is simply the square root of the variance:
$\sigma_{\bar {x}} = \sqrt { \frac{\sigma^{2}}{n} } = \frac{\sigma}{ \sqrt{n}}$

^Austin et al. go on to say

When focusing on a single statistic, the Monte Carlo error can be computed as $\sqrt{B/M}$.

So this seems to confirm my logic above. If so, then is between-imputation variance $B$ or $B/M$ or $(1+1/m)B$ ?

Questions on step 6:

It seems to me that the total variance formula says $var(Q) = B/M + \bar{U}$. Am I right?
If so, why does the formula in these papers multiply $B$ by $(1+1/m)$ and use $var(Q) = B + B/m + \bar{U}$?

The paper's go on to say we can do Student T Test confidence intervals:

$(Q-\bar{Q})/T \sim $ Student T statistic with $\nu = (m-1)[1 + \frac{\bar{U}}{(1+(1/m))B}]^2$
This means $\bar{Q} \pm AT$ is a $k$ confidence interval if $A$ is the two-sided t-value at $(1-k, \nu)$.

Question: If $m$ is sufficiently large (unsure, maybe 10K?) couldn't we use a z-score/Normal for our tests since Q(Y) normal?

General Question
Can you check my understanding of why we cannot just use $ \sum(Q^i – \bar{Q})^2/(m-1)$ as the estimate of variance of $Q$. Because if $Y_{mis}$ doesn't effect $Q$ at all, then we'd get variance of 0 with this computation. We need $U$ to account for the variance that comes from $Q$ on $Y_{obs}$. Is that right?

Best Answer

The total variance formula you would like to use is based on an infinitely large number of imputations, with $\bar U_{\infty}$ and $B_{\infty}$ the within- and between-imputation variance. The estimate of the within-imputation variance $\bar U$ based on a finite number $m$ of imputations is unbiased, but there's a bias if you substitute the finite-imputation $B$ for $B_{\infty}$. Quoting Section 2.3.2 of van Buuren's Flexible Imputation of Missing Data:

It is tempting to conclude that the total variance $T$ is equal to the sum of $\bar U$ and $B$, but that would be incorrect. We need to incorporate the fact that $\bar Q$ itself is estimated using finite $m$, and thus only approximates $Q_{\infty}$. Rubin shows that the contribution to the variance of this factor is systematic and equal to $B_{\infty}/m$.

With $B$ providing an approximation to $B_{\infty}$, you get the 3rd contribution to the total variance, $B/m$, that you are questioning.

Related Solutions

Solved – Variance of the Poisson Binomial Distribution

Think of the case where $n=2$. If $p_1 = p_2 = 0.5$, this maximizes the variance of X. If $p_1 = 0$ and $p_2 = 1$, then $X=1$ and there is no variance.

Confidence Interval – How to Get Confidence Interval for Population Variance

Let's take a look at a bootstrap based approach, and compare the results to the CLT based confidence interval. First, I'll define a population distribution which is trimodal, skewed, and heavy tailed (compared to Normal).

rmydist <- function(n){
  i <- sample(3, n, TRUE)
  x1 <- rnorm(n)
  x2 <- rgamma(n, 1.2, 0.5) +2
  x3 <- rbeta(n, 1.8, 0.5)*3 - 4
  x <- (i==1)*x1 + (i==2)*x2 + (i==3)*x3
  return(x)
}

# Plot histogram with huge n
x <- rmydist(1e8)
hist(x, breaks=30)

The "true" variance, based on this really huge sample from the population, is $\sigma^2 \approx 8.612$ (this matches the exact variance, which can be computed using the Law of Total Variance).

Now we can compute 95% confidence intervals using (i) the bootstrap, (ii) the accelerated bootstrap (iii) accelerated bootstrap on the log-sd scale (see @whuber's comment) and (iv) chi-square approximation for $n=200$.

# Generate data
set.seed(12345)
n <- 200
x <- rmydist(n)

# Confidence interval with (percentile) bootstrap
B <- 10000
boot <- rep(NA, B)
for(i in 1:B){
  xnew <- sample(x, n, TRUE)
  boot[i] <- var(xnew)
}
CI_1 <- quantile(boot, probs=c(0.025, 0.975))

# Confidence interval with accelerated bootstrap
#devtools::install_github("knrumsey/quack")
library(quack)
a <- est_accel(x, var)
CI_2 <- quack::boot_accel(x, var, alpha=0.05, a=a)

# Confidence interval with accelerated bootstrap (log-scale)
logsd <- function(xx) log(sd(xx))
a <- est_accel(x, logsd)
tmp  <- quack::boot_accel(x, logsd, alpha=0.05, a=a)
CI_3 <- exp(tmp)^2

# Confidence interval based on CLT
chi <- qchisq(c(0.025, 0.975), n-1)
CI_4 <- (n-1)*var(x)/rev(chi)

The confidence interval for these four cases comes out to be

Method	Lower bound	Upper bound
Bootstrap	$6.52$	$9.52$
Accelerated bootstrap	$6.74$	$9.76$
Accelerated boot (log-sd)	$6.74$	$9.80$
CLT	$6.68$	$9.91$

Note that all 3 methods capture the "true value" of $8.612$ here. But one dataset isn't very interesting, so lets perform a simulation study.

Simulation study

We can repeat the R analysis conducted above one thousand times for each of the four methods. We are interested in (i) empirical coverage (the number of times each method captures the true value) and (ii) the width of the confidence interval.

Edit: Thanks to @COOLSerdash, who extended this simulation study to include the methods discussed by Bonett (2016) and O'Neill (2014) (see also @Ben's answer). I have added these methods to the table below for ease of comparison.

Method	Empirical coverage	Interval width (average)
Percentile Bootstrap	91.3%	4.22
Accelerated bootstrap	93.0%	4.57
Accelerated boot. (log-sd)	93.0%	4.58
CLT	86.8%	3.44
Bonett (2006)	93.5%	4.64
O'Neill (2014)	91.9%	4.39

It is interesting to note that all of these methods undercover compared to nominal (95%), but the CLT based method is especially over-confident, yielding precise intervals which fail to capture the true value more often that it should.