Solved – Imbalanced Test Data

machine learningmetricunbalanced-classes

I have an imbalanced (1:5) training and test set with only two classes and have oversampled the training set with SMOTE so that the class ratio is 1:1. The ML model gives values over 0.7 for accuracy, precision, recall, and f1 for the training set. However, since the test set is still imbalanced (1:5), the metrics are still above 0.7 but only because it is performing well on the majority class and failing miserably on the minority class (even though it did okay on the training data). Perhaps it is overfitting and not generalizing well to the test set. Currently, it is able to correctly classify around 6% of the minority class in the test data.

Does anybody have any suggestions for building a more robust ML model for document binary classification and, additionally, are there better metrics to use when your test set is imbalanced (i.e., FPR and TPR)?

Best Answer

(This started as a comment)

Regarding some good threads already available. I would strongly suggest looking into the threads:

  1. Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
  2. When is unbalanced data really a problem in Machine Learning?
  3. What problem does oversampling, undersampling, and SMOTE solve?

They give a very good idea about the sublimity of the imbalance learning problem. They should help built a better appreciation of the issue because reading bite-sized cook-book suggestions (like the one I will do below) is only a stop-gap measure.

Regarding the calibration of prediction:

If the observed class proportions before re-sampling is say 0.5-to-99.5 and we do a 1% negative downsampling, the observed class proportions in our new sample will become now reflect approximately a 34-to-66 proportion. This is our "downsampled space" where we train the learner. We need to re-calibrate our learner for actual deployment so we get back the 0.5% prediction; that is because in our original space, a 34-to-66 proportion would lead to unreasonably high predicted probabilities. A straightforward way would be to calculate the new probabilities as $q = \frac{p}{p + \frac{1-p}{w}}$ where $p$ is the prediction in downsampled space and $w$ is the the negative downsampling rate. So for example if we predicted $p = 0.5$ in the example above, the actual probability should be more like $q = 0.009901 = (\text{because: } \frac{0.5}{0.5 + 0.5/0.01})$.

Two good first references on the matter are: Dal Pozzolo et al. (2015) Calibrating Probability with Undersampling for Unbalanced Classification and Elkan (2002) The foundations of cost-sensitive learning. (The formula I wrote above is effectively Eq. 3 from Dal Pozzolo's paper.)

Just to be clear: in any classification problem it is far better to focus on assigning costs for misclassification rather than keep hammering about metrics like AUC-ROC, AUC-PR, Cohen's $\kappa$ and the likes. As a real life example: A screening tool and a diagnostic tool serve different purposes so evaluating their utility based on the same metric is probably an oversimplification.