Confidence Interval – Rubin’s Rule of Pooled Confidence Interval: Detailed Explanation

confidence intervalmultiple-imputationpooling

Rubin's Rule for multiple imputation states that

you are to construct a single interval after pooling into a single set of estimates and standard errors:

$$
\bar{\theta} \pm t_{df,1-\frac{\alpha}{2}}*SE_{Pooled}
$$

Is this fundamentally different than constructing CI for each imputation and pooling each bounds separately as arithmetic mean, something like below?

$$
\left[\frac{1}{m}\left(\sum_{i=1}^m LowerBound_i\right),
\frac{1}{m}\left(\sum_{i=1}^m UpperBound_i\right)\right]
$$

where $m$ is the number of imputations?

From the small dataset I had, the results seemed to be identical, at least up to a certain decimal point, but it could have been due to chance and I wish to make sure in theoretical standpoint.

Best Answer

What you write might come close in some circumstances but can't be counted on in general. For example, putting multiple imputation aside for a moment, an average of hazard-ratio confidence intervals from Cox survival regression models among bootstrapped samples from a complete data set will tend to be very poorly behaved.

For multiple imputation, Section 2.3 of Stef van Buuren's Flexible Imputation of Missing Data explains that Rubin's Rules take not only within-imputation and between-imputation variances into account but also a further variance due to a finite number of imputations. The variance of an averaged statistic $\bar Q$ among $m$ imputations thus has three sources:

The total variance [of $\bar Q$] stems from three sources:

$\bar U$, the variance caused by the fact that we are taking a sample rather than observing the entire population. This is the conventional statistical measure of variability;

$B$, the extra variance caused by the fact that there are missing values in the sample;

$B/m$, the extra simulation variance caused by the fact that $\bar Q$ itself is estimated for finite $m$.

The addition of the latter term is critical to make multiple imputation work at low values of $m$.

What you write seems to be most closely related to the variance contributed by $\bar U$, although it might include some contribution from $B$ insofar as the mean-value estimates change among imputation sets and thus shift the CI. It doesn't seem, however, to include the extra variance due to a finite value of $m$. If you had a large number $m$ of imputations that might not be a big problem.

So it's safest to follow Rubin's Rules and stick with the pooled SE.

Related Solutions

Solved – Multiple Imputation: If “Pooled” isn’t “averaged” then what is it

Pooled typically refers to a "weighted" average. If you have two samples and estimates of each samples variance is $s_1^2$ and $s_2^2$ you might consider the pooled estimate:

$$ s^2 = \dfrac{ (n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$$

Note this is not a simple average which would be $$\dfrac{s_1^2+s_2^2}{2}$$

The idea is that each sample might be based on a different sample size and you want to account for that in your estimate (the estimate that comes from the larger sample size should have more of an impact on your final estimate than the estimate from the smaller sample size).

Multiple Imputation – Applying Rubin’s Rule for Combining Multiply Imputed Datasets

Rubin's rules can only be applied to parameters following a normal distribution. For parameters with a F or Chi Square distribution a different set of formulas is needed:

Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.

For performing an ANOVA on multiple imputed datasets you could use the R package miceadds (pdf; miceadds::mi.anova).

Update 1

Here is a complete example:

Export your data from SPSS to R. In Spss save your dataset as .csv
Read in your dataset:
```
library(miceadds)   
dat <– read.csv(file='your-dataset.csv')
```
Lets assume, that $reading$ is your dependent variable and that you have two factors
- gender, with male = 0 and female = 1
- treatment, with control = 0 and 'received treatment' = 1
Now lets convert them to factors:
```
dat$gender    <- factor(dat$gender)
dat$treatment <- factor(dat$treatment)
```
Convert your dataset to a mids object, wehere we assume, that the first variable holds the imputation number (Imputation_ in SPSS):
```
dat.mids <- as.mids(dat)
```

Now you can perform an ANOVA:

fit <- mi.anova(mi.res=dat.mids, formula="reading~gender*treatment", type=3)
summary(fit)

Update 2 This is a reply to your second comment:

What you describe here is a data import/export related problem between SPSS and R. You could try to import the .sav file directly into R and there are a bunch of dedicated packages for that: foreign, rio, gdata, Hmisc, etc. I prefer the csv-way, but that's a matter of taste and/or depends on the nature of your problem. Maybe you should also check some tutorials on youtube or other sources on the internet.

library(foreign)
dat <- read.spss(file='path-to-sav', use.value.labels=F, to.data.frame=T)

Update 3 This is a reply to your first comment:

Yes, you can do your analysis in SPSS and pool the F values in miceadds (please note this example is taken from the miceadds::micombine.F help page):

library(miceadds)
Fvalues <- c(6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78, 
             6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78 )
micombine(Fvalues, df1=4)

Best Answer

Related Solutions

Solved – Multiple Imputation: If “Pooled” isn’t “averaged” then what is it

Multiple Imputation – Applying Rubin’s Rule for Combining Multiply Imputed Datasets

Related Question