Solved – Is up- or down-sampling imbalanced data actually that effective? Why

classificationrocunbalanced-classes

I frequently hear up- or down-sampling of data discussed as a way of dealing with classification of imbalanced data.

I understand that this could be useful if you're working with a binary (as opposed to probabilistic or score-based) classifier and treating it as a black box, so sampling schemes are your only way to tweak its position on the "ROC curve" (in quotes because if your classifier is inherently binary I guess it doesn't have a real ROC curve, but the same concept of trading off false positives and false negatives still applies).

But it seems like the same justification doesn't hold if you actually have access to some kind of score that you're later thresholding to make a decision. In this case, isn't up-sampling just an ad-hoc way of expressing a view about your desired trade-off between false positives and false negatives when you have much better tools available, like actual ROC analysis? It seems like it would be weird in this case to expect up-sampling or down-sampling to do anything but change your classifier's "prior" on each class (i.e. unconditional probability of being that class, the baseline prediction)–I wouldn't expect it to change the classifier's "odds ratio" (how much the classifier adjusts its baseline prediction based on the covariates).

So my question is: if you have a classifier that isn't a binary black box, are there any reasons to expect up- or down-sampling to have a much better effect than adjusting the threshold to your liking? Failing that, are there any empirical studies showing reasonably large effects for up- or down-sampling on reasonable performance metrics (e.g., not accuracy)?

Best Answer

The short answer appears to be Yes: there is some evidence that upsampling of the minority class and/or downsampling of the majority class in a training set can somewhat improve out-of-sample AUC (area under the ROC curve, a threshold-independent metric) even on the unaltered, unbalanced data distribution.

With that said, in most or all of the examples I've seen, the increases in AUC are very modest -- a typical "best case" (i.e., best over all the models and sampling methods examined by an author) would be, say, AUC = .91 without up/downsampling vs. AUC = .93 with up/downsampling. I haven't seen any example where applying up/downsampling could turn a bad AUC into a good AUC in any circumstance. I'm also not aware of evidence that upsampling/downsampling can improve generalization under a strictly proper scoring rule like the Brier score (see this great answer for more info about that).

Some evidence

  • The most direct evidence I've seen, and the only one that includes some theoretical analysis, is from this paper Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis?. The authors show that when using LDA on two Gaussian classes with unequal covariance matrices (contrary to an assumption of LDA), both simple upsampling and simple downsampling (nothing fancier like SMOTE) to achieve 50:50 class balance can improve generalization for the unbalanced data distribution. Here's a key figure:

enter image description here

  • In this paper Handling class imbalance in customer churn prediction, the authors examine both simple downsampling ("under-sampling") and an "advanced under-sampling method" called CUBE, for logistic regression and random forest. They conclude that downsampling helps, but CUBE does not seem to improve over simple downsampling to any meaningful extent. In this key figure, the leftmost point on each curve is for the unaltered dataset with no downsampling: enter image description here

  • In this example from the docs for the imbalanced-learn Python package, the authors examine the AUC performance of a K nearest neighbors classifier under three sophisticated up/downsampling methods as well as baseline (no up/downsampling). Here is the key figure showing the ROC curves: enter image description here

  • I found this R notebook that looks at logistic regression, comparing cross-validated AUC for baseline (no up/downsamping) vs. simple downsampling vs. a more sophisticated upsampling method called ROSE. The author concludes that simple downsampling doesn't help much overall, but that ROSE leads to a somewhat better ROC curve overall. In the key figure below,

    • green curve = ROSE (AUC = .639)
    • black curve = baseline (AUC = .587)
    • red curve = simple downsampling (AUC = .575) enter image description here