Solved – R AUC never less than 0.5

aucrrandom forest

I'm doing some work with random forests in R using the randomForest package, and I've run into something that seems odd to me. Even when the data is completely random, the AUC is never less than 0.5. For example, when I run the following:

library(randomForest)
df.sanity <- data.frame(A=sample(1:100, 2000, replace=T), B=sample(126:159, 2000, replace=T), C=sample(10:2000, 1000, replace=T), D=sample(1:2, 2000, replace=T), E=sample(30:40, 2000, replace=T), Class=as.factor(sample(0:1, 2000, replace=T)))
rf <- randomForest(x=df.sanity[1:1000,c("A", "B", "C", "D", "E")], y=df.sanity[1:1000, "Class"])
preds <- predict(rf, newdata=df.sanity[1001:2000,], type="prob")
auc(obs=df.sanity[1001:2000, "Class"], pred=preds[,2])

No matter how many times I run it, the AUC is never less than 0.5. It's often a bit over (up to 0.54 from what I've seen), but never less.

The only other AUC implementation I've used is Weka's, and I've seen AUCs < 0.5 there. Does the randomForest package automatically flip the predictions to the reverse if the AUC is ever less than 0.5, or is there something else I'm misunderstanding here?

Best Answer

There is no auc() function in the randomForest package. But based on the argument names you used (obs and pred), I think you might have used the auc() function in the SDMTools package. And yes, this function does flip the results if the calculated AUC is less than 0.5:

> SDMTools::auc
function (obs, pred) 
{
    … code to calculate the AUC …
    if (AUC < 0.5) 
        AUC = 1 - AUC
    return(AUC)
}

This might be seen as a nice convenience feature (it’s easy to forget exactly how AUC functions want their arguments coded, and if you get an AUC < .5 in real life, you have usually just used the inverse/incorrect coding of the response vector ), but I think it’s a bad idea. If you try models that are (almost) as bad as random, e.g., in simulations, you will get estimates that are biased high (compared the ‘correct’ AUC estimator).

If you want the correct AUC estimates, you can use either

  • ROC() from the Epi packages. It draws a nice ROC plot, with the AUC embedded, and also returns an object with the AUC stored as the AUC element.
  • rcorr.sens() from the rms package. It returns a list with the AUC stored as the first element.
  • roc() from the pROC package if you manually specify the direction argument. It returns an object with the AUC stored as the auc element.