Solved – When is unbalanced data really a problem in Machine Learning

classificationfaqmachine learningpredictive-modelsunbalanced-classes

We already had multiple questions about unbalanced data when using logistic regression, SVM, decision trees, bagging and a number of other similar questions, what makes it a very popular topic! Unfortunately, each of the questions seems to be algorithm-specific and I didn't find any general guidelines for dealing with unbalanced data.

Quoting one of the answers by Marc Claesen, dealing with unbalanced data

(…) heavily depends on the learning method. Most general purpose
approaches have one (or several) ways to deal with this.

But when exactly should we worry about unbalanced data? Which algorithms are mostly affected by it and which are able to deal with it? Which algorithms would need us to balance the data? I am aware that discussing each of the algorithms would be impossible on a Q&A site like this. I am rather looking for general guidelines on when it could be a problem.

Best Answer

Not a direct answer, but it's worth noting that in the statistical literature, some of the prejudice against unbalanced data has historical roots.

Many classical models simplify neatly under the assumption of balanced data, especially for methods like ANOVA that are closely related to experimental design—a traditional / original motivation for developing statistical methods.

But the statistical / probabilistic arithmetic gets quite ugly, quite quickly, with unbalanced data. Prior to the widespread adoption of computers, the by-hand calculations were so extensive that estimating models on unbalanced data was practically impossible.

Of course, computers have basically rendered this a non-issue. Likewise, we can estimate models on massive datasets, solve high-dimensional optimization problems, and draw samples from analytically intractable joint probability distributions, all of which were functionally impossible like, fifty years ago.

It's an old problem, and academics sank a lot of time into working on the problem...meanwhile, many applied problems outpaced / obviated that research, but old habits die hard...

Edit to add:

I realize I didn't come out and just say it: there isn't a low level problem with using unbalanced data. In my experience, the advice to "avoid unbalanced data" is either algorithm-specific, or inherited wisdom. I agree with AdamO that in general, unbalanced data poses no conceptual problem to a well-specified model.