Solved – perform Random Forest AFTER multiple imputation with MICE

micemissing datamultiple-imputationrandom forestregression

i wanted to build a prediction model.
Since my data had some missing data, I imputed data with the MICE algorithm.
After that I wanted to do a regression with Random Forest.

Now I'm kinda stuck because:

I wanted to do Multiple Imputation with MICE because I wanted to show consideration for the variance of the missing variables in my model.
So I imputed 5 data sets with MICE.

If i wanted to do a glm, I would build 5 models(for each imputed data set) and then pool them together. (Meaning in the end i have 1 model and the variance of my parameters will be higher)

Now what I want to do is to build a random forest. But I just can't find any strategies for this. Since RF doesn't have parameter estimations, I can't pool them together…

Has anyone worked on this before? or any advices what I should do?

Best wishes and thank you in advance!
I really appreciate any help and answers

Ching

Best Answer

This is not a direct answer to your question, and I don't have enough reputation to comment, but one thing you can do is use the Machine Learning in R package. There are many random forest learner implementations there that can use data with missing values. You can also tune the learners based on what your dataset is.

Links to the package and documentation are on the main tutorial page, here:

https://mlr-org.github.io/mlr-tutorial/release/html/index.html

Also, consider that answering your question becomes much easier if you provide a sample of your dataset.

If you need a direct answer, looping a series of RF calls on the imputed datasets might work. E.g. if you have five imputations:

res = data.frame(matrix(0,nrow=nrow(test),ncol=5)
for (i in 1:5){
  data = complete(miceResult, 1)
  rf.res = cforest(data,formula ~ [which formula?])
  res[,i] = predict(rf.res, test)
}

Then you can pool the results by majority voting or averaging, depending on your dataset. You can also group the 5 imputations together and train the learner with the combined dataset. Both methods are suboptimal, however.

Hope this helps.

Related Solutions

Solved – Questions on multiple imputation with MICE for a multigroup-SEM-analysis? (including survey weights)

(I'm the creator of lavaan.survey)

As Stas already indicated, the combination (multiple imputation * complex sampling) can be tricky business. The main papers are Kott (1995) and Kim, Brick & Fuller (2006).

Here are some considerations:

As mentioned by Stas, all the usual best practices of MI apply. Considering the below, I would probably not use quickpred() initially. There is a risk it will discard things that you actually need. It might help to make some reasonable subselection though.
If you have weights, these need to be included in the imputation model as a covariate (Kim et al. 2006, p. 518). Since you are doing multiple group analysis ("domain estimation"), you also need to include the interaction between the group dummies and the weights in the imputation model (p. 519).
If you have strata and clusters, things become more complicated. The imputation model needs to account for the resulting correlation between the observations. If not you will get the wrong standard errors (Kim et al. 2006: p. 514). A model-based way of doing this might be to include strata as fixed effects and clusters as random effects in a Bayesian imputation model. A more survey-like approach would be to follow Stas' suggestion and use a resampling procedure that respects the strata and clusters. For example, in bootstrapping and with just the clusters, you would sample a random cluster (PSU) with replacement and then individuals (2SUS) with replacement within the sampled clusters.

Another advantage of Stas' resampling suggestion, even without strata and clusters, is that you will account for the uncertainty about the parameters of the imputation model including that caused by the weights. I am not sure if mice does this accurately by default. This is usually a relatively small additional term in the variance but it might make a difference.

Once you have the multiply imputed datasets, you can just pass these as an imputationList to lavaan.survey (see the JSS lavaan.survey paper). lavaan.survey will then do all the usual MI pooling calculations for you. So you don't need to manually fit a model separately for each imputation!

Hope this helps,

All the best, Daniel

P.S. Thanks to Stas and @Gaming_dude who brought this post to my attention. I would be happy to continue the conversation (here, lavaan Google discussion group, twitter, email..)!

Solved – Multiple Imputation how to get one dataset out m=50

It seems that you want to stack the imputed datasets. As noted by those who have commented previously, this is not the best way to analyse the data (point estimates tend to be accurate, but the variability accounted for by the imputation process is no longer present and error will be reduced). Nevertheless, stacking the data is achieved by using the complete function in the mice package. Once stacked, the data can be exported easily to other software programs.

# Impute the data using the default options
imp <- mice(df)

# Check convergence
plot(imp)

# Stack imputed data into one LONG dataset (generates two new variables indicating id and imputation number); raw (unimputed) data are appended (inc = TRUE)
com <- complete(imp, "long", inc = TRUE)

# Obtain first imputed dataset
com <- complete(imp)
com <- complete(imp, 1)

# Obtain second imputed dataset
com <- complete(imp, 2)

It is also possible to export the mids object (imp) directly to SPSS (if that is your other software) using the mids2spss function in mice.

Best Answer

Related Solutions

Solved – Questions on multiple imputation with MICE for a multigroup-SEM-analysis? (including survey weights)

Solved – Multiple Imputation how to get one dataset out m=50

Related Question