Supervised Machine Learning – Addressing Class Imbalance

machine learningsupervised learningunbalanced-classes

This is a question in general, not specific to any method or data set. How do we deal with a class imbalance problem in Supervised Machine learning where the number of 0 is around 90% and number of 1 is around 10% in your dataset.How do we optimally train the classifier.

One of the ways which I follow is sampling to make the dataset balanced and then train the classifier and repeat this for multiple samples.

I feel this is random, Is there any framework to approach these kind of problems.

Best Answer

There are many frameworks and approaches. This is a recurrent issue.

Examples:

  • Undersampling. Select a subsample of the sets of zeros such that it's size matches the set of ones. There is an obvious loss of information, unless you use a more complex framework (for a instance, I would split the first set on 9 smaller, mutually exclusive subsets, train a model on each one of them and ensemble the models).
  • Oversampling. Produce artificial ones until the proportion is 50%/50%. My previous employer used this by default. There are many frameworks for this (I think SMOTE is the most popular, but I prefer simpler tricks like Noisy PCA).
  • One Class Learning. Just assume your data has a few real points (the ones) and lots of random noise that doesn't physically exists leaked into the dataset (anything that is not a one is noise). Use an algorithm to denoise the data instead of a classification algorithm.
  • Cost-Sensitive Training. Use a asymmetric cost function to artificially balance the training process.

Some lit reviews, in increasing order of technical complexity\level of details:

Oh, and by the way, 90%/10% is not unbalanced. Card transaction fraud datasets often are split 99.97%/0.03%. This is unbalanced.