Solved – High AUC but low R squared in a random forest classifier

aucginimachine learningr-squaredrandom forest

I have been looking for an answer on this website and on Google but I can't seem to find a clear explanation anywhere.

The problem is the following. I built a Random Forest model (using Python's sklearn module) for a binary classification task.

Training, test, everything seems to go well, and I get a relatively high ROC-AUC compared to the previous iteration of my model (I actually calculate the Gini coefficient but use the formule described in this question to convert it to AUC), at around 0.67.

After putting the model in production for a while and that it indeed seems to perform better than the previous one (still with that AUC metric).

At some point I am asked what the R-squared of my model was, that is, as I understand it, the proportion of variance explained. I am a bit puzzled as I usually heard about this metric in a regression task.

"No problem" I think, "I can just estimate the R-squared of my model by checking the labels as 0 and 1 and use the predicted probability as the predicted value". It seems to work relatively well on the training set (R-squared of 0.8), as soon as I try it on a test set it gets really low, even quite often negative !

As far as I understand it, AUC (or Gini) tells me how well I can "separate" my data between each class, and is thus the most important in a classification task, however I am worried a low R-squared tells something about my model. Does it overfit ? Should I instead use the label predicted ?

Best Answer

As you said, it R-squared is not a measure commonly used for classification. What you are using is called Efron's pseudo R-squared. There is not much literature around it. If you want the paper where is proposed, it is here.

I think I know what is happening: you probably have an unbalanced dataset, and, intuitively, this skews the value of the pseudo R-squared. This is just my intuition, though.

In any case, you are right, you should rely on classification specific metrics, such as the AUC. AUC, in particular, tells you how good is the ranking of your algorithm: if you randomly select a positive and a negative instance, it tells you the probability that your model will rank the positive instance higher than the negative one. Nevertheless, it has its problems (see D. Hand, 2009, Measuring classifier performance: a coherent alternative to the area under the ROC curve). If you want you can use other classification metrics to double check your results (e.g. precision, recall, accuracy, f1, etc) but I wouldn't rely on the R-squared too much.