Consider the following R code:
> data <- data.frame(
a=c(NA,2,3,4,5,6),b=c(2.2,NA,6.1,8.3,10.2,12.13),c=c(4.2,7.9,NA,16.1,19.9,23))
> data
a b c
1 NA 2.20 4.2
2 2 NA 7.9
3 3 6.10 NA
4 4 8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0
As you can see I've engineered the data so that roughly c = 2*b = 4*a
. As such, I would expect the missing values to be around a=1, b=2, c=12
. So I performed the analysis:
> imp <- mi(data)
Beginning Multiple Imputation ( Sat Oct 18 03:02:41 2014 ):
Iteration 1
Chain 1 : a* b* c*
Chain 2 : a* b* c*
Chain 3 : a* b* c*
Iteration 2
Chain 1 : a* b c
Chain 2 : a* b* c*
Chain 3 : a b* c
Iteration 3
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a* b* c*
Iteration 4
Chain 1 : a b c
Chain 2 : a b* c
Chain 3 : a* b c
Iteration 5
Chain 1 : a b c*
Chain 2 : a b* c
Chain 3 : a b* c
Iteration 6
Chain 1 : a* b c*
Chain 2 : a b c
Chain 3 : a b c
Iteration 7
Chain 1 : a b c
Chain 2 : a b* c
Chain 3 : a b c*
Iteration 8
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b* c*
Iteration 9
Chain 1 : a b c
Chain 2 : a b c*
Chain 3 : a b c
Iteration 10
Chain 1 : a b* c
Chain 2 : a b c
Chain 3 : a b c
Iteration 11
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 12
Chain 1 : a b c
Chain 2 : a* b c
Chain 3 : a b c
Iteration 13
Chain 1 : a b c
Chain 2 : a b c*
Chain 3 : a b c*
Iteration 14
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 15
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c*
Iteration 16
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b* c
Iteration 17
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 18
Chain 1 : a b c*
Chain 2 : a b c
Chain 3 : a b c
Iteration 19
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c*
Iteration 20
Chain 1 : a b c*
Chain 2 : a b c
Chain 3 : a b c
Iteration 21
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 22
Chain 1 : a b c*
Chain 2 : a b c
Chain 3 : a b c
Iteration 23
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 24
Chain 1 : a b c*
Chain 2 : a b c
Chain 3 : a b c
Iteration 25
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 26
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 27
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 28
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 29
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
mi converged ( Sat Oct 18 03:02:45 2014 )
Run 20 more iterations to mitigate the influence of the noise...
Beginning Multiple Imputation ( Sat Oct 18 03:02:45 2014 ):
Iteration 1
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 2
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 3
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 4
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 5
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 6
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 7
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 8
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 9
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 10
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 11
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 12
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 13
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 14
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 15
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 16
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 17
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 18
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 19
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Iteration 20
Chain 1 : a b c
Chain 2 : a b c
Chain 3 : a b c
Reached the maximum iteration, mi did not converge ( Sat Oct 18 03:02:48 2014 )
And finally observed the completed data set:
> mi.completed(imp)
[[1]]
a b c
1 2 2.20 4.2
2 2 2.20 7.9
3 3 6.10 16.1
4 4 8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0
[[2]]
a b c
1 2 2.20 4.2
2 2 6.10 7.9
3 3 6.10 7.9
4 4 8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0
[[3]]
a b c
1 2 2.20 4.2
2 2 2.20 7.9
3 3 6.10 7.9
4 4 8.30 16.1
5 5 10.20 19.9
6 6 12.13 23.0
As you can see the imputed values are not what I expected. Actually, they look like the result of single imputation as the missing values have been seemingly taken from adjacent records.
What am I missing?
I should note that my "knowledge" in statistics is mostly limited to what I vaguely remember from an introductory course I took ~14 years ago. I'm just looking for a simple way to impute missing values, it doesn't have to be the most optimized one but it does need to make some sort of sense (which I can't make of these results). It may well be the case that mi
isn't the correct approach to achieve what I want (perhaps predict should be used instead), so I'm open to suggestions.
I also tried a similar approach with mice
, which led to similar results.
UPDATE Amelia works great out of the box. Would still be interesting to know what I'm missing with mi / mice though.
Best Answer
Given that you are using six cases [records] and three variables, the quality of your imputation will be quite low.
To see why this will be the case, remember that multiple imputation works by filling in missing values with plausible imputed values. These imputed values are calculated in $m$ separate datasets (I will return to how these imputed values are derived later in this answer). The imputed values will vary slightly from dataset to dataset.
Thus, given a statistical quantity of interest $q$ (e.g., a mean, a regression coefficient, etc), one can use the $m$ datasets to estimate the average standard error for $q$ within the $m$ datasets (a quantity that I will call the within-imputation variance, or $\bar{U}$) and the degree to which $q$ varies across the $m$ datasets (a quantity that I will call the between-imputation variance, or $B$).
The relationship between imputation quality, $B$, and $\bar{U}$
One can use the within-imputation variance $\bar{U}$ and the between-imputation variance $B$ to derive an estimate of the degree to which an imputed estimate of a statistical quantity has been influenced by missing information. Of course, the more information has been lost, the poorer the quality of the imputation. The estimate of information lost to missingness is labeled $\gamma$, and is given by the following formula:
$$\gamma = \frac{r + \frac{2}{df + 3}}{r + 1}$$
$r$ in this formula is a ratio of the between-imputation variance $B$ to the within-imputation-variance $\bar{U}$:
$$r = \frac{(1 + \frac1m)B}{\bar{U}}$$
Thus, high values of $B$ result in high values of $r$, which in turn will result in high values of $\gamma$. A high value of $\gamma$, in turn, indicates more information lost due to missing data and a poorer quality imputation.
$df$ in the formula for $\gamma$ is also a function of $B$ and $\bar{U}$. Specifically, $df$ is estimated by
$$df = (m - 1)\left(1 + \frac{m\bar{U}}{(m + 1)B}\right)^2$$
Thus, in addition to increasing the ratio of between-imputation variance to within-imputation variance, increasing $B$ also decreases $df$. This will result in a higher value of $\gamma$, indicating more information lost to missingness and a poorer quality imputation.
In sum, higher values of the between-imputation variance $B$ affect imputation quality in two ways:
The relationship between the number of cases and $B$
Given two otherwise similar datasets, a dataset with a smaller number of cases will have a larger between-imputation variance $B$.
This will occur because, as I describe above, the between-imputation variance is computed by computing a statistical quantity of interest $q$ within each of $m$ imputed datasets and computing the degree to which $q$ varies across each of the $m$ datasets. If a given dataset has a higher quantity of cases but a similar quantity of missing values as another, a smaller proportion of values will be free to vary across each of the $m$ imputed datasets, meaning that there will be lower overall variation in $q$ across the imputed datasets.
Thus, in general, increasing the number of cases (or, more precisely, decreasing the proportion of missing values) will increase imputation quality.
The relationship between the number of variables and $B$
Given two otherwise similar datasets, a dataset with a larger number of variables will have a smaller between-imputation variance $B$, as long as those extra variables are informative about the missing values.
This will occur because, in general, missing values for a given variable are "filled in" by using information from other variables to generate plausible estimates of the missing values (the specific details of how these estimates are generated will vary depending on the MI implementation you're using). More information in the form of extra variables will result in more stable imputed values, resulting in less variation in statistical quantity of interest $q$ across each of the $m$ imputed datasets.
Thus, in general, increasing the number of variables available in a dataset will increase imputation quality, as long as those extra variables are informative about the missing values.
References
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473-489.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3-15.