Solved – Prediction with randomForest (R) when some inputs have missing values (NA)

missing datapredictionrrandom forest

I have a fine randomForest classification model which I would like to use in an application that predicts the class of a new case. The new case has inevitably missing values. Predict won't work as such for NAs. How should I do this then?

data(iris)
# create first the new case with missing values
na.row<-45
na.col<-c(3,5)
case.na<-iris[na.row,]
case.na[,na.col]<-NA

iris.rf <- randomForest(Species ~ ., data=iris[-na.row,])
# print(iris.rf)

myrf.pred <- predict(iris.rf, case.na[-5], type="response")
myrf.pred
[1] <NA>

I tried missForest. I combined the original data and the new case, shaked it with missForest, and got imputed values for NAs in my new case. Too heavy computing though.

data.imp <- missForest(data.with.na)

But there must be a way to use rf-model to predict a new case with missing values, right?

Best Answer

You have no choice but to impute the values or to change models. A good choice could be aregImpute in the Hmisc package. I think its less heavy than rfimpute which is what is detaining you, first package example (there are others):

# Check that aregImpute can almost exactly estimate missing values when
# there is a perfect nonlinear relationship between two variables
# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
set.seed(3)
x1 <- rnorm(200)
x2 <- x1^2
x3 <- runif(200)
m <- 30
x2[1:m] <- NA
a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
a
matplot(x1[1:m]^2, a$imputed$x2)
abline(a=0, b=1, lty=2)

x1[1:m]^2
a$imputed$x2

# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
x1 <- factor(sample(c('a','b','c'),1000,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 <- rnorm(1000)
y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 <- x1[1:250]
orig.x2 <- x2[251:350]
x1[1:250] <- NA
x2[251:350] <- NA
d <- data.frame(x1,x2,x3,y)
# Find value of nk that yields best validating imputation models
# tlinear=FALSE means to not force the target variable to be linear
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
                data=d, B=10) # normally B=75
f
# Try forcing target variable (x1, then x2) to be linear while allowing
# predictors to be nonlinear (could also say tlinear=TRUE)
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
f

# Use 100 imputations to better check against individual true values
f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat <- function(u) {
 tab <- table(u)
 as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, 
                       data=d)
sqrt(diag(vcov(fmi)))
fcc <- lm(y ~ x1 + x2 + x3)
summary(fcc)   # SEs are larger than from mult. imputation

You mention that you have many new observations that have missing values on the independant variables. Even though you have many cases like this, if for each new observation there is only missings in one or two of its variables and your amount of variables is not tiny maybe just filling the holes up with a median or average (are they continuous?) could work.

Another thing that could be interesting is to do a minor variable importance analysis. The random forest R implementation calculates two importance measures and respective plots:

varImpPlot(yourRandomForestModel) # yourRandomForestModel must have the argument importance=TRUE

And you can play around with just including "important" variables in the model training, till the prediction accuracy isn't all that affected in comparison to the "full model". Maybe you keep variables with a low number of missings. It could help you reduce the size of your problem.

Related Solutions

Solved – How to impute a missing categorical predictor variable for a random forest model

I think you need an unsupervised imputing method. That is one which do not use the target values for imputation. If you only have few prediction feature vectors, it may be difficult to uncover a data structure. Instead you could mix your predictions with already imputed training feature vectors and use this structure to impute once again. Notice this procedure may violate assumptions of independence, therefore wrap the entire procedure in an outer cross-validation to check for serious overfitting.

I just learned about missForest from a comment to this question. missForest seems to do the trick. I simulated your problem on the iris data. (without outer cross-validation)

rm(list=ls())
data("iris")
set.seed(1234)
n.train = 100
train.index = sample(nrow(iris),n.train)
feature.train = as.matrix(iris[ train.index,1:4])
feature.test  = as.matrix(iris[-train.index,1:4])


#simulate 40 NAs in train
n.NAs = 40
NA.index = sample(length(feature.train),n.NAs)
NA.feature.train = feature.train; NA.feature.train[NA.index] = NA

#imputing 40 NAs unsupervised
library(missForest)
imp.feature.train = missForest(NA.feature.train)$ximp
#check how well imputation went, seems promsing for this data set
plot(    feature.train[NA.index],xlab="true value",
     imp.feature.train[NA.index],ylab="imp  value",)

#simulate random NAs in feature test
feature.test[sample(length(feature.test),20)] = NA

#mix feature.test with imp.feature.train
nrow.test = nrow(feature.test)
mix.feature = rbind(feature.test,imp.feature.train)
imp.feature.test = missForest(mix.feature)$ximp[1:nrow.test,]

#train RF and predict
library(randomForest)
rf = randomForest(imp.feature.train,iris$Species[train.index])
pred.test = predict(rf,imp.feature.test)
table(pred.test, iris$Species[-train.index])

Printing...
-----------------
pred.test    setosa versicolor virginica
  setosa         12          0         0
  versicolor      0         20         2
  virginica       0          1        15

Solved – the proper way to use rfImpute? (Imputation by Random Forest in R)

I'm not entirely sure if this is an answer to your question, but maybe you'll find it useful.

Maybe the author of the randomForest package would disagree with me, but I feel like the rfImpute() function is mostly used or called upon other imputation packages in their algorithms to impute many variables. If you only have one variable with missing data, then using this function as a stand alone may work. However, I think it is the case for most people that they have many variables with missing data in a datset that they'd like to impute. Enter the packages missForest and mice.

If you use the R package missForest, you can impute your entire dataset (many variables of different types may be missing) with one command missForest(). If I recall correctly, this function draws on the rfImpute() function from the randomForest package. For some reason (maybe others can elaborate), when you use the missForest() function, the other variables that are used to predict a single variable can also have missingness. So I think using this function and package are a nice idea if you are hoping to only get one dataset out, after all variables have been imputed.

The downside to using missForest() is that you only get one dataset, which does not allow you to take into account the uncertainty of your estimates (in your follow-on analytical models). So your analytical models will have incorrect confidence intervals if you just base the analysis on that one imputed dataset. If that doesn't matter to you, then I highly recommend this package and function, because it is very easy to use and specify your imputation model.

However, if you do need to get appropriate confidence intervals and pooled estimates in your analytical models, then you should probably use multivariate imputation by chained equations (MICE) approaches to imputation. For this, you can use the mice package. There is recent functionality within this package that allows you to specify which variables you'd like to impute with a random forest algorithm, and which you would like to use the usual methods (e.g. pmm). When specifying your imputation model with the mice() function, under methods you would do something like meth <- c("rfcat", "rfcont").

missForest has a nice vignette you can look up in R.

Here is a nice resource for how to set up your imputation models using mice:

http://www.stefvanbuuren.nl/publications/MICE%20in%20R%20-%20Draft.pdf

Best Answer

Related Solutions

Solved – How to impute a missing categorical predictor variable for a random forest model

Solved – the proper way to use rfImpute? (Imputation by Random Forest in R)

Related Question