Solved – Optimising for Precision-Recall curves under class imbalance

data visualizationmachine learningprecision-recallrocunbalanced-classes

I have a classification task where I have a number of predictors (one of which is the most informative), and I am using the MARS model to construct my classifier (I am interested in any simple model, and using glms for illustrative purposes would be fine too). Now I have a huge class imbalance in the training data (about 2700 negative samples for each positive sample). Similar to Information Retrieval tasks, I am more concerned about predicting the top ranking positive test samples. For this reason, the performance on Precision Recall curves is important to me.

First of all, I simply trained the model on my training data keeping the class imbalance as it is. I visualize my trained model in red, and the most important input in blue.

Training on unbalanced data, evaluation on unbalanced data:

PR for unbalanced training
ROC for unbalanced training

Thinking that the class imbalance is throwing the model off, since learning the top ranking positive samples is a miniscule part of the whole data set, I upsampled the positive training points to get a balanced training data set. When I plot the performance on the balanced training set, I get good performance. In both the PR and ROC curves, my trained model does better then the inputs.

Training on (upsampled) balanced data, evaluation also on (upsampled) balanced data:

PR for balanced training, visualised on balanced dataset
ROC for balanced training, visualised on balanced dataset

However, if I use this model trained on the balanced data, to predict on the original, unbalanced training set, I still get bad performance on the PR curve.

Training on (upsampled) balanced data, evaluation on original unbalanced data:

PR for balanced training, visualised on original, unbalanced dataset
ROC for balanced training, visualised on original, unbalanced dataset

So my questions are:

  1. Is the reason the visualization of the PR curve shows inferior performance of my trained model (red), while ROC curve shows improvements because of the class imbalance?
  2. Can resampling/up-sampling/down-sampling approaches resolve this to force the training to focus on the high precision/low recall region?
  3. Is there any other way to focus training on the high precision/low recall region?

Best Answer

  1. The ROC curve is insensitive to changes in class imbalance; see Fawcett (2004) "ROC Graphs: Notes and Practical Considerations for Researchers".
  2. Up-sampling the low-frequency class is a reasonable approach.
  3. There are many other ways of dealing with class imbalance. Boosting and bagging are two techniques that come to mind. This seems like a relevant recent study: Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data

P.S. Neat problem; I'd love to know how it turns out.

Related Question