Solved – What problem does oversampling, undersampling, and SMOTE solve

classificationmachine learningpredictive-modelsunbalanced-classes

In a recent, well recieved, question, Tim asks when is unbalanced data really a problem in Machine Learning? The premise of the question is that there is a lot of machine learning literature discussing class balance and the problem of imbalanced classes. The idea is that datasets with an imbalance between the positive and negative class cause problems for some machine learning classification (I'm including probabilistic models here) algorithms, and methods should be sought to "balance" the dataset, restoring the perfect 50/50 split between positive and negative classes.

The general sense of the upvoted answers is that "it's not, at least if you are thoughtful in your modeling". M. Henry L., in an up-voted comment to an accepted answer, states

[…] there isn't a low level problem with using unbalanced data. In my experience, the advice to "avoid unbalanced data" is either algorithm-specific, or inherited wisdom. I agree with AdamO that in general, unbalanced data poses no conceptual problem to a well-specified model.

AdamO argues that the "problem" with class balance is really one of class rarity

Therefore, at least in regression (but I suspect in all circumstances), the only problem with imbalanced data is that you effectively have small sample size. If any method is suitable for the number of people in the rarer class, there should be no issue if their proportion membership is imbalanced.

If this is the true issue at hand, it leaves an open question: what is the purpose of all the resampling methods intended to balance the dataset: oversampling, undersampling, SMOTE, etc? Clearly they don't address the problem of implicitly having a small sample size, you can't create information out of nothing!

Best Answer

The problem that these methods are trying to solve is to increase the impact of minority class on cost function. This is because algos trying to fit well the whole dataset and then adapt to majority. Other approach would be to use class weights, and this aporoach in most cases gives better results, since there is no information loss by undersampling or performance loss and introduction of noise by oversampling.