Solved – missForest Data imputation vs. MICE using RF as imputation method

data-imputationmicerandom forest

Is the missForest package a special case of MICE using Random Forest as imputation (for just a single imputation)?

The missForest algorithm is described here: https://academic.oup.com/bioinformatics/article/28/1/112/219101 (chapter 2)
The MICE algorithm here: https://stefvanbuuren.name/fimd/sec-FCS.html (chapter 4.5.2)

To me both approaches look pretty similar (or even the same?).
missForest does a sorting of the variables to be imputed by amount of missings first and (as far as a i know) MICE uses the order in which variables are supplied to the function, but other than that i can't see any differences?

It would be great if somebody could confirm or disprove my thoughts (and point out the differences of the two approaches in case).

This is just a question of interest. I am not having a usecase for data imputation in mind yet.

Best Answer

I think one of the differences is that missForest is, at least in its original form, a method for single imputation, i.e. imputing a single best imputation. It tries to find the "best" predictions for the missing values given some set of predictors that it identifies (e.g. using internal variable selection). It therefore does not account for inherent variability due to missingness, leading to overconfidence or bias in whatever analysis you do using the imputed dataset (see here and here for a discussion, and this paper for bias due to missForest). mice on the other hand, even under a random forest-based implementation, tries to produce imputations that have a random component to them, instead of the "best" prediction (see here for a discussion). See this paper for how mice does random forest-based imputation. Essentially, it runs multiple random forest imputation models on bootstrapped samples of observed data and randomly selects the predicted value from one of the models.

Related Question