Solved – Purpose of class balancing

classificationmachine learningregressionunbalanced-classes

I see people doing class balancing (via oversampling, etc.) before learning classifiers all the time. I wanted to know why does class balancing improve classification accuracy. Is it true all the time. If so, is it true for all classification tasks. I was trying to find some theory paper to justify this but couldn't find any.

Best Answer

What I am going to write is partly described in some of the posts mentioned in the comments. I thought How to deal with a skewed class in binary classification having many features? is close to what I want to say.

I think that in general, class balancing and oversampling will not improve overall accuracy, but that is not the goal. As described in the cited post, with strong class imbalance you can get very high accuracy by simply saying everything is majority class. What I would like to emphasize is that getting the highest accuracy is not always the goal. It is often better to make more false positive errors in return for eliminating some of the false negatives. Many diseases have a fairly low incidence rate. But simply saying that no one has the disease is not an acceptable solution. If you identify most of the true positives (and include some false positives) additional testing can be applied to a small group to sort out which cases were real and which were not. The overall accuracy is lower, but you identify more of the cases that it is critical to identify.