Solved – Does oversampling/undersampling change the distribution of the data

classificationoversamplingrandom forestresamplingunbalanced-classes

I have an imbalanced dataset (10000 positives and 300 negatives) and have divided this into train and test sets. I perform oversampling/undersampling only on the train set since doing this on the test set would not represent a real-world scenario.

A Random Forest Classifier is able to classify the training set well (F-score of 0.92 for both positive and negative class) but performs badly on the test set (F-score of 0.83 for the positive class and 0.13 for the negative class).

Why does the classifier perform poorly on the test set although it has learnt to identify the difference between the two classes in the train set? Could it be because the distribution of the train set is now different from the test set? If so, how do I take care of this?

I came across this post but the answers are not particularly helpful.

Best Answer

The answer to the title question is "of course it does"; you are shifting the distribution toward the minority class.

You can shift your model's predictions back to match the original distribution, see e.g. Convert predicted probabilities after downsampling to actual probabilities in classification or, equivalently, adjust the prediction threshold.

There's also a serious question on whether you needed to resample in the first place. See What is the root cause of the class imbalance problem?, When is unbalanced data really a problem in Machine Learning? If you do get better performance after balancing, with correct use of prediction thresholds/shifting, I'd like to know about it. I haven't been able to find a definitive answer on whether balancing helps a classifier learn. (Henry's answer to the second linked question here suggests not, but...)