Solved – Calculating AUC on test set of random forest model in R

auccross-validationrrandom forest

I have built a random forest model within a k-fold cv, and have had no problem calculating the AUC on the training set, but after trying a few methods of calculating AUC I keep getting errors to the effect of "response and predictors must be of the same length" when I try to use that same code on the test data.

I suspect this is because I built the model on the larger training data, but i'm not sure.

Below is an example of the code i've been using.

# setting up k-fold

k <- 5
set.seed(123)
folds <- rep_len(1:k,nrow(df))
folds <- sample(folds,nrow(df))

# model
for (i in 1:k){
fold <- which(folds == i)
rf.model <- randomForest(df$outcome ~., ntree = 200, data = df, subset = -fold)
}

# subset out training and test set 
train <- df[-fold,]
test <- df[fold,]

# calculate ROC/AUC on training data 
roc.train <- roc(train$outcome, rf.model$votes[,2])
auc(roc.train)

# calculate ROC/AUC on test data 
roc.test <- roc(test$outcome, rf.model$votes[,2])
auc(roc.test)

As I said, it works fine on the training data, but not on the test data. What is the appropriate way to calculate this? Any advice is appreciated!

Best Answer

Your cross-validation should be comparing the outcome in the test data to the predictions your model makes given the predictive features in the test data.

So you need to generate the predictions first!

At the moment you are trying to compare predictions from the training model to the outcomes in the test data.

I don't know the randomforest package well, but you'll need to run something like:

test.predictions <- predict(rf.model, newdata=test)
roc.test <- roc(test$outcome, test.predictions$votes[,2])
auc(roc.test)

Another issue you have is that although you are estimating lots of randomForests on different training datasets in your for loop, you are only doing the testing part on the last iteration, because that part is outside the loop. So you'll need to address that.