Solved – Why is this multiple imputation low quality

data-imputationmultiple-imputationr

Consider the following R code:

> data <- data.frame(
            a=c(NA,2,3,4,5,6),b=c(2.2,NA,6.1,8.3,10.2,12.13),c=c(4.2,7.9,NA,16.1,19.9,23))
> data
   a     b    c
1 NA  2.20  4.2
2  2    NA  7.9
3  3  6.10   NA
4  4  8.30 16.1
5  5 10.20 19.9
6  6 12.13 23.0

As you can see I've engineered the data so that roughly c = 2*b = 4*a. As such, I would expect the missing values to be around a=1, b=2, c=12. So I performed the analysis:

> imp <- mi(data)
Beginning Multiple Imputation ( Sat Oct 18 03:02:41 2014 ):
Iteration 1 
 Chain 1 : a*  b*  c*  
 Chain 2 : a*  b*  c*  
 Chain 3 : a*  b*  c*  
Iteration 2 
 Chain 1 : a*  b   c   
 Chain 2 : a*  b*  c*  
 Chain 3 : a   b*  c   
Iteration 3 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a*  b*  c*  
Iteration 4 
 Chain 1 : a   b   c   
 Chain 2 : a   b*  c   
 Chain 3 : a*  b   c   
Iteration 5 
 Chain 1 : a   b   c*  
 Chain 2 : a   b*  c   
 Chain 3 : a   b*  c   
Iteration 6 
 Chain 1 : a*  b   c*  
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 7 
 Chain 1 : a   b   c   
 Chain 2 : a   b*  c   
 Chain 3 : a   b   c*  
Iteration 8 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b*  c*  
Iteration 9 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c*  
 Chain 3 : a   b   c   
Iteration 10 
 Chain 1 : a   b*  c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 11 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 12 
 Chain 1 : a   b   c   
 Chain 2 : a*  b   c   
 Chain 3 : a   b   c   
Iteration 13 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c*  
 Chain 3 : a   b   c*  
Iteration 14 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 15 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c*  
Iteration 16 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b*  c   
Iteration 17 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 18 
 Chain 1 : a   b   c*  
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 19 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c*  
Iteration 20 
 Chain 1 : a   b   c*  
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 21 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 22 
 Chain 1 : a   b   c*  
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 23 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 24 
 Chain 1 : a   b   c*  
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 25 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 26 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 27 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 28 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 29 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
mi converged ( Sat Oct 18 03:02:45 2014 )
Run 20 more iterations to mitigate the influence of the noise...
Beginning Multiple Imputation ( Sat Oct 18 03:02:45 2014 ):
Iteration 1 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 2 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 3 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 4 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 5 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 6 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 7 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 8 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 9 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 10 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 11 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 12 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 13 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 14 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 15 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 16 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 17 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 18 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 19 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Iteration 20 
 Chain 1 : a   b   c   
 Chain 2 : a   b   c   
 Chain 3 : a   b   c   
Reached the maximum iteration, mi did not converge ( Sat Oct 18 03:02:48 2014 )

And finally observed the completed data set:

> mi.completed(imp)
[[1]]
  a     b    c
1 2  2.20  4.2
2 2  2.20  7.9
3 3  6.10 16.1
4 4  8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0

[[2]]
  a     b    c
1 2  2.20  4.2
2 2  6.10  7.9
3 3  6.10  7.9
4 4  8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0

[[3]]
  a     b    c
1 2  2.20  4.2
2 2  2.20  7.9
3 3  6.10  7.9
4 4  8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0

As you can see the imputed values are not what I expected. Actually, they look like the result of single imputation as the missing values have been seemingly taken from adjacent records.

What am I missing?

I should note that my "knowledge" in statistics is mostly limited to what I vaguely remember from an introductory course I took ~14 years ago. I'm just looking for a simple way to impute missing values, it doesn't have to be the most optimized one but it does need to make some sort of sense (which I can't make of these results). It may well be the case that mi isn't the correct approach to achieve what I want (perhaps predict should be used instead), so I'm open to suggestions.

I also tried a similar approach with mice, which led to similar results.

UPDATE Amelia works great out of the box. Would still be interesting to know what I'm missing with mi / mice though.

Best Answer

Given that you are using six cases [records] and three variables, the quality of your imputation will be quite low.

To see why this will be the case, remember that multiple imputation works by filling in missing values with plausible imputed values. These imputed values are calculated in $m$ separate datasets (I will return to how these imputed values are derived later in this answer). The imputed values will vary slightly from dataset to dataset.

Thus, given a statistical quantity of interest $q$ (e.g., a mean, a regression coefficient, etc), one can use the $m$ datasets to estimate the average standard error for $q$ within the $m$ datasets (a quantity that I will call the within-imputation variance, or $\bar{U}$) and the degree to which $q$ varies across the $m$ datasets (a quantity that I will call the between-imputation variance, or $B$).

The relationship between imputation quality, $B$, and $\bar{U}$

One can use the within-imputation variance $\bar{U}$ and the between-imputation variance $B$ to derive an estimate of the degree to which an imputed estimate of a statistical quantity has been influenced by missing information. Of course, the more information has been lost, the poorer the quality of the imputation. The estimate of information lost to missingness is labeled $\gamma$, and is given by the following formula:

$$\gamma = \frac{r + \frac{2}{df + 3}}{r + 1}$$

$r$ in this formula is a ratio of the between-imputation variance $B$ to the within-imputation-variance $\bar{U}$:

$$r = \frac{(1 + \frac1m)B}{\bar{U}}$$

Thus, high values of $B$ result in high values of $r$, which in turn will result in high values of $\gamma$. A high value of $\gamma$, in turn, indicates more information lost due to missing data and a poorer quality imputation.

$df$ in the formula for $\gamma$ is also a function of $B$ and $\bar{U}$. Specifically, $df$ is estimated by

$$df = (m - 1)\left(1 + \frac{m\bar{U}}{(m + 1)B}\right)^2$$

Thus, in addition to increasing the ratio of between-imputation variance to within-imputation variance, increasing $B$ also decreases $df$. This will result in a higher value of $\gamma$, indicating more information lost to missingness and a poorer quality imputation.

In sum, higher values of the between-imputation variance $B$ affect imputation quality in two ways:

  1. Higher values of $B$ increase the ratio of the variance between imputations to the variance within imputations, decreasing imputation quality
  2. Higher values of $B$ decrease the available degrees of freedom, decreasing imputation quality

The relationship between the number of cases and $B$

Given two otherwise similar datasets, a dataset with a smaller number of cases will have a larger between-imputation variance $B$.

This will occur because, as I describe above, the between-imputation variance is computed by computing a statistical quantity of interest $q$ within each of $m$ imputed datasets and computing the degree to which $q$ varies across each of the $m$ datasets. If a given dataset has a higher quantity of cases but a similar quantity of missing values as another, a smaller proportion of values will be free to vary across each of the $m$ imputed datasets, meaning that there will be lower overall variation in $q$ across the imputed datasets.

Thus, in general, increasing the number of cases (or, more precisely, decreasing the proportion of missing values) will increase imputation quality.

The relationship between the number of variables and $B$

Given two otherwise similar datasets, a dataset with a larger number of variables will have a smaller between-imputation variance $B$, as long as those extra variables are informative about the missing values.

This will occur because, in general, missing values for a given variable are "filled in" by using information from other variables to generate plausible estimates of the missing values (the specific details of how these estimates are generated will vary depending on the MI implementation you're using). More information in the form of extra variables will result in more stable imputed values, resulting in less variation in statistical quantity of interest $q$ across each of the $m$ imputed datasets.

Thus, in general, increasing the number of variables available in a dataset will increase imputation quality, as long as those extra variables are informative about the missing values.

References

Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473-489.

Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3-15.