Solved – High AUC but low R squared in a random forest classifier

aucginimachine learningr-squaredrandom forest

I have been looking for an answer on this website and on Google but I can't seem to find a clear explanation anywhere.

The problem is the following. I built a Random Forest model (using Python's sklearn module) for a binary classification task.

Training, test, everything seems to go well, and I get a relatively high ROC-AUC compared to the previous iteration of my model (I actually calculate the Gini coefficient but use the formule described in this question to convert it to AUC), at around 0.67.

After putting the model in production for a while and that it indeed seems to perform better than the previous one (still with that AUC metric).

At some point I am asked what the R-squared of my model was, that is, as I understand it, the proportion of variance explained. I am a bit puzzled as I usually heard about this metric in a regression task.

"No problem" I think, "I can just estimate the R-squared of my model by checking the labels as 0 and 1 and use the predicted probability as the predicted value". It seems to work relatively well on the training set (R-squared of 0.8), as soon as I try it on a test set it gets really low, even quite often negative !

As far as I understand it, AUC (or Gini) tells me how well I can "separate" my data between each class, and is thus the most important in a classification task, however I am worried a low R-squared tells something about my model. Does it overfit ? Should I instead use the label predicted ?

Best Answer

As you said, it R-squared is not a measure commonly used for classification. What you are using is called Efron's pseudo R-squared. There is not much literature around it. If you want the paper where is proposed, it is here.

I think I know what is happening: you probably have an unbalanced dataset, and, intuitively, this skews the value of the pseudo R-squared. This is just my intuition, though.

In any case, you are right, you should rely on classification specific metrics, such as the AUC. AUC, in particular, tells you how good is the ranking of your algorithm: if you randomly select a positive and a negative instance, it tells you the probability that your model will rank the positive instance higher than the negative one. Nevertheless, it has its problems (see D. Hand, 2009, Measuring classifier performance: a coherent alternative to the area under the ROC curve). If you want you can use other classification metrics to double check your results (e.g. precision, recall, accuracy, f1, etc) but I wouldn't rely on the R-squared too much.

Related Solutions

Solved – TOO low estimated SVM probability for most of the negative test examples

The first thing you want to do is look at the outputs of your trained csvm (not the posterior probabilities!). What is happening is that the fitSVMPosterior tries to fit a sigmoid through the scores / outputs to generate the posterior probabilities. Plot the scores versus the class label. If the outputs do not seem to follow a sigmoid kind of curve, then you know you are in trouble since fitSVMPosterior will not be able to fit it. The best way to evaluate this is to train your classifier for example on a trainset, and plot the predicted scores on a testset versus their class labels.

Furthermore, you mention you use oversampling to adres the imbalance issue. You can also try use weights to train your SVM instead. Apparently Matlab's behavior is to set the weights in such a way they sum up to the prior probabilities (see here). So definitely try to use just a regular sample of your data as well to evaluate your posteriors.

What could be happening is that the svm model (svmc) that you train, is evaluated using the accuracy during cross validation. Furthermore, the svmc model uses the hingeloss. These performance measures do not say anything about the quality of the posteriors. So the problem is that the model tries to minimize the accuracy, and because of this the posterior quality might not be good. I'm going a bit on a limb here, but I'm going to assume you want a model that outputs proper posterior probabilities. So in my answer I'll detail how to do just that.

There are four options: (1) if you really prefer the SVM model, you could change the cross validation procedure to measure the quality of the posteriors, and you can perform cross validation using this performance measure to obtain better posteriors. (2) If you looked at the scores versus the outputs of the model of the SVM and did saw a pattern that could be classified somehow, but not using a sigmoid, it is possible to write your own function to fit the posteriors using a different model, but this could be a lot of work. (3) You could use a kernelized penalized logistic regression model, which directly optimizes the quality of the posteriors during the training procedure (my recommended solution). (4) You could use a Gaussian processes classification model, but they are quite hard to train in practice.

(1) To do this you would: train your model with some hyperparameters (cost, sigma of the kernel if you use a Gaussian kernel) on the training fold, fit the SVM posterior model on the training fold, and predict the posteriors on the test fold. Then compute the log likelihood on the testset using the posteriors and the class labels. Repeat this for all folds, and choose the hyperparameters that give the best log likelihood. Why? The log likelyhood measures the quality of the posteriors, so if this is optimized by cross validation, possibly you will get better posteriors. However, it might be the case that this will not work very well, since the csvm itself does not aim to give accurate posteriors.

2) Take a look at the Platt scaling that is used by fitSVMPosterior. What you can do instead of a logistic transformation, is to use binning. You can bin the scores, and compute the posterior for each bin. You can find some details here. Possibly this will give you better results, but it will likely be a pain to implement and is not used that often...

3) Penalized kernel logistic regression is similar to SVM's. It uses regularization (which corresponds to the cost parameter of the SVM) and can be used with kernels like the SVM model. Mark Schmidt has a nice Matlab implementation here. Take a look at the file minFunc_examples.m and then look for Kernel logistic regression. This model performs quite well for classification in terms of accuracy and can be used to get proper probability estimates.

4) Gaussian processes naturally compute posterior probabilities. If you want to know more I definitely recommend reading this free book. The website also contains code samples which you can use (however, it will take quite some reading to understand what you want to use when).

Finally, it is possible that all these models estimate small posterior probabilities. Maybe this is just optimal? Therefore, if you want the best posteriors, be sure to compare them using some performance measure. As I said before you can use the log likelihood to evaluate the quality of your posteriors.

But perhaps you have some costs for false positives and false negatives? Try to use the performance measure that you are interested in to evaluate your model. If you are going to compare your models, be sure to use proper cross validation, otherwise you will not be able to tell which model is better ;).

Binary Classification – Optimizing AUC vs Logloss

As you mention, AUC is a rank statistic (i.e. scale invariant) & log loss is a calibration statistic. One may trivially construct a model which has the same AUC but fails to minimize log loss w.r.t. some other model by scaling the predicted values. Consider:

auc <-  function(prediction, actual) {
  mann_whit <- wilcox.test(prediction~actual)$statistic
  1 - mann_whit / (sum(actual)*as.double(sum(!actual)))
}

log_loss <- function (prediction, actual) {
  -1/length(prediction) * sum(actual * log(prediction) + (1-actual) * log(1-prediction))
}

sampled_data <- function(effect_size, positive_prior = .03, n_obs = 5e3) {
  y <- rbinom(n_obs, size = 1, prob = positive_prior)
  data.frame( y = y,
              x1 =rnorm(n_obs, mean = ifelse(y==1, effect_size, 0)))
}

train_data <- sampled_data(4)
m1 <- glm(y~x1, data = train_data, family = 'binomial')
m2 <- m1
m2$coefficients[2] <- 2 * m2$coefficients[2]

m1_predictions <- predict(m1, newdata = train_data, type= 'response')
m2_predictions <- predict(m2, newdata = train_data, type= 'response')

auc(m1_predictions, train_data$y)
#0.9925867 
auc(m2_predictions, train_data$y)
#0.9925867 

log_loss(m1_predictions, train_data$y)
#0.01985058
log_loss(m2_predictions, train_data$y)
#0.2355433

So, we cannot say that a model maximizing AUC means minimized log loss. Whether a model minimizing log loss corresponds to maximized AUC will rely heavily on the context; class separability, model bias, etc. In practice, one might consider a weak relationship, but in general they are simply different objectives. Consider the following example which grows the class separability (effect size of our predictor):

for (effect_size in 1:7) {
  results <- dplyr::bind_rows(lapply(1:100, function(trial) {
                                    train_data <- sampled_data(effect_size)
                                    m <- glm(y~x1, data = train_data, family = 'binomial')
                                    predictions <- predict(m, type = 'response')
                                    list(auc = auc(predictions, train_data$y),
                                         log_loss = log_loss(predictions, train_data$y),
                                         effect_size = effect_size)
                                  }))
  plot(results$auc, results$log_loss, main = paste("Effect size =", effect_size))
  readline()
}

Best Answer

Related Solutions

Solved – TOO low estimated SVM probability for most of the negative test examples

Binary Classification – Optimizing AUC vs Logloss

Related Question