Solved – Prediction with randomForest (R) when some inputs have missing values (NA)

missing datapredictionrrandom forest

I have a fine randomForest classification model which I would like to use in an application that predicts the class of a new case. The new case has inevitably missing values. Predict won't work as such for NAs. How should I do this then?

data(iris)
# create first the new case with missing values
na.row<-45
na.col<-c(3,5)
case.na<-iris[na.row,]
case.na[,na.col]<-NA

iris.rf <- randomForest(Species ~ ., data=iris[-na.row,])
# print(iris.rf)

myrf.pred <- predict(iris.rf, case.na[-5], type="response")
myrf.pred
[1] <NA>

I tried missForest. I combined the original data and the new case, shaked it with missForest, and got imputed values for NAs in my new case. Too heavy computing though.

data.imp <- missForest(data.with.na)

But there must be a way to use rf-model to predict a new case with missing values, right?

Best Answer

You have no choice but to impute the values or to change models. A good choice could be aregImpute in the Hmisc package. I think its less heavy than rfimpute which is what is detaining you, first package example (there are others):

# Check that aregImpute can almost exactly estimate missing values when
# there is a perfect nonlinear relationship between two variables
# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
set.seed(3)
x1 <- rnorm(200)
x2 <- x1^2
x3 <- runif(200)
m <- 30
x2[1:m] <- NA
a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
a
matplot(x1[1:m]^2, a$imputed$x2)
abline(a=0, b=1, lty=2)

x1[1:m]^2
a$imputed$x2

# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
x1 <- factor(sample(c('a','b','c'),1000,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 <- rnorm(1000)
y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 <- x1[1:250]
orig.x2 <- x2[251:350]
x1[1:250] <- NA
x2[251:350] <- NA
d <- data.frame(x1,x2,x3,y)
# Find value of nk that yields best validating imputation models
# tlinear=FALSE means to not force the target variable to be linear
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
                data=d, B=10) # normally B=75
f
# Try forcing target variable (x1, then x2) to be linear while allowing
# predictors to be nonlinear (could also say tlinear=TRUE)
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
f

# Use 100 imputations to better check against individual true values
f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat <- function(u) {
 tab <- table(u)
 as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, 
                       data=d)
sqrt(diag(vcov(fmi)))
fcc <- lm(y ~ x1 + x2 + x3)
summary(fcc)   # SEs are larger than from mult. imputation

You mention that you have many new observations that have missing values on the independant variables. Even though you have many cases like this, if for each new observation there is only missings in one or two of its variables and your amount of variables is not tiny maybe just filling the holes up with a median or average (are they continuous?) could work.

Another thing that could be interesting is to do a minor variable importance analysis. The random forest R implementation calculates two importance measures and respective plots:

varImpPlot(yourRandomForestModel) # yourRandomForestModel must have the argument importance=TRUE 

And you can play around with just including "important" variables in the model training, till the prediction accuracy isn't all that affected in comparison to the "full model". Maybe you keep variables with a low number of missings. It could help you reduce the size of your problem.