Solved – In machine learning, Is it better to have class ratios balanced or representative of the population

machine learningunbalanced-classes

In the context of machine learning, let's say you have a problem in which classes in the real population are not balanced – eg Class A occurs 80% of the time and Class B occurs 20% of the time.

In such a case, is it generally better to have a given ML algorithm rely on data with the same 80/20 class ratio, or data with a balanced (50/50) ratio?
a) with regards to training data
b) with regards to test data

A followup question: In case the answer for (a) or (b) happens to be going with the balanced 50/50 ratio, then does this preference generally still persist even in the practical context where the data one has access to happens to be of the 80/20 ratio? In other words, would the benefit of using a balanced ratio to train and/or test outweigh the cost of enforcing that ratio (e.g.
by discarding instances from the majority class or generating new synthetic samples of the minority class)?

Best Answer

Check this paper for a good review of learning with inbalanced datasets.

One way of dealing with the problem is to do artificial subsampling or upsampling in the training set to balance the datasets.

I think it is usually better to have a balanced training set, since otherwise the decision boundary is gonna give too much space to the bigger class and you are going to misclassify too much the small class. This is usually bad. (think of cancer detection where the smaller class is the most costly, namely having a tumor).

If you don't want to use sampling methods, than you can use cost based methods, where you weight the importance of every sample so that the loss function has more contribution from the samples of the most important class. In cancer detection, you would weight more the cost coming from training samples of hte positive class (having a tumor).

Finally, remember that if the test set is very unbalanced classification accuracy is not a good measure of performance. You would be better off using precision/recall and the f-score, easily computed from the confusion matrix. Check this paper for references on classification performance measures for a lots of different scenarios.

Also another good read on the topic is this one.

Related Question