I have a fine randomForest
classification model which I would like to use in an application that predicts the class of a new case. The new case has inevitably missing values. Predict won't work as such for NAs. How should I do this then?
data(iris)
# create first the new case with missing values
na.row<-45
na.col<-c(3,5)
case.na<-iris[na.row,]
case.na[,na.col]<-NA
iris.rf <- randomForest(Species ~ ., data=iris[-na.row,])
# print(iris.rf)
myrf.pred <- predict(iris.rf, case.na[-5], type="response")
myrf.pred
[1] <NA>
I tried missForest
. I combined the original data and the new case, shaked it with missForest
, and got imputed values for NAs in my new case. Too heavy computing though.
data.imp <- missForest(data.with.na)
But there must be a way to use rf-model to predict a new case with missing values, right?
Best Answer
You have no choice but to impute the values or to change models. A good choice could be aregImpute in the Hmisc package. I think its less heavy than rfimpute which is what is detaining you, first package example (there are others):
You mention that you have many new observations that have missing values on the independant variables. Even though you have many cases like this, if for each new observation there is only missings in one or two of its variables and your amount of variables is not tiny maybe just filling the holes up with a median or average (are they continuous?) could work.
Another thing that could be interesting is to do a minor variable importance analysis. The random forest R implementation calculates two importance measures and respective plots:
And you can play around with just including "important" variables in the model training, till the prediction accuracy isn't all that affected in comparison to the "full model". Maybe you keep variables with a low number of missings. It could help you reduce the size of your problem.