Solved – perform Random Forest AFTER multiple imputation with MICE

micemissing datamultiple-imputationrandom forestregression

i wanted to build a prediction model.
Since my data had some missing data, I imputed data with the MICE algorithm.
After that I wanted to do a regression with Random Forest.

Now I'm kinda stuck because:

I wanted to do Multiple Imputation with MICE because I wanted to show consideration for the variance of the missing variables in my model.
So I imputed 5 data sets with MICE.

If i wanted to do a glm, I would build 5 models(for each imputed data set) and then pool them together. (Meaning in the end i have 1 model and the variance of my parameters will be higher)

Now what I want to do is to build a random forest. But I just can't find any strategies for this. Since RF doesn't have parameter estimations, I can't pool them together…

Has anyone worked on this before? or any advices what I should do?

Best wishes and thank you in advance!
I really appreciate any help and answers

Ching

Best Answer

This is not a direct answer to your question, and I don't have enough reputation to comment, but one thing you can do is use the Machine Learning in R package. There are many random forest learner implementations there that can use data with missing values. You can also tune the learners based on what your dataset is.

Links to the package and documentation are on the main tutorial page, here:

https://mlr-org.github.io/mlr-tutorial/release/html/index.html

Also, consider that answering your question becomes much easier if you provide a sample of your dataset.

If you need a direct answer, looping a series of RF calls on the imputed datasets might work. E.g. if you have five imputations:

res = data.frame(matrix(0,nrow=nrow(test),ncol=5)
for (i in 1:5){
  data = complete(miceResult, 1)
  rf.res = cforest(data,formula ~ [which formula?])
  res[,i] = predict(rf.res, test)
}

Then you can pool the results by majority voting or averaging, depending on your dataset. You can also group the 5 imputations together and train the learner with the combined dataset. Both methods are suboptimal, however.

Hope this helps.