Solved – How to impute a missing categorical predictor variable for a random forest model

missing datarrandom forest

I have a set of x, y data I'm using to build a random forest. The x data is a vector of values that includes some NAs. So I use rfImpute to handle the missing data and create a random forest. Now I have a new unseen observation x (with an NA) and I want to predict y. How do I impute the missing value so that I may use the random forest that I have already grown? The rfImpute function seems to require x and y. I only have x for prediction purposes.

My question is similar (but different) to this question. And for example, I can use the same iris dataset. If I've correctly interpreted the code in the answer to the question I reference, the code iris.na[148, , drop=FALSE] in the statement iris.na2 = rbind(iris.imputed, iris.na[148, , drop=FALSE]) represents the new data which includes the Species (the Y value). In my problem I would not know the Species—I want to use the random forest to predict that. I would have the 4 independent variables, but some might be NA for a given row. To continue the analogy, imagine I have 3 of the 4 variables (one is missing). I want to impute that value. Then I want to predict the species which I do not know.

In response to gung's comment that I should add an illustration, let me put it in terms of the iris data set. Imagine I have the following data on a flower. I know it's Sepal.Length, Sepal.Width, Petal.Length, but not the Petal.Width. I'd like to impute the Petal.Width and then use those 4 values within a RF model to predict the Species.

Best Answer

I think you need an unsupervised imputing method. That is one which do not use the target values for imputation. If you only have few prediction feature vectors, it may be difficult to uncover a data structure. Instead you could mix your predictions with already imputed training feature vectors and use this structure to impute once again. Notice this procedure may violate assumptions of independence, therefore wrap the entire procedure in an outer cross-validation to check for serious overfitting.

I just learned about missForest from a comment to this question. missForest seems to do the trick. I simulated your problem on the iris data. (without outer cross-validation)

rm(list=ls())
data("iris")
set.seed(1234)
n.train = 100
train.index = sample(nrow(iris),n.train)
feature.train = as.matrix(iris[ train.index,1:4])
feature.test  = as.matrix(iris[-train.index,1:4])


#simulate 40 NAs in train
n.NAs = 40
NA.index = sample(length(feature.train),n.NAs)
NA.feature.train = feature.train; NA.feature.train[NA.index] = NA

#imputing 40 NAs unsupervised
library(missForest)
imp.feature.train = missForest(NA.feature.train)$ximp
#check how well imputation went, seems promsing for this data set
plot(    feature.train[NA.index],xlab="true value",
     imp.feature.train[NA.index],ylab="imp  value",)

#simulate random NAs in feature test
feature.test[sample(length(feature.test),20)] = NA

#mix feature.test with imp.feature.train
nrow.test = nrow(feature.test)
mix.feature = rbind(feature.test,imp.feature.train)
imp.feature.test = missForest(mix.feature)$ximp[1:nrow.test,]

#train RF and predict
library(randomForest)
rf = randomForest(imp.feature.train,iris$Species[train.index])
pred.test = predict(rf,imp.feature.test)
table(pred.test, iris$Species[-train.index])

Printing...
-----------------
pred.test    setosa versicolor virginica
  setosa         12          0         0
  versicolor      0         20         2
  virginica       0          1        15
Related Question