Machine Learning – Best Way to Handle Unbalanced Multiclass Dataset with SVM

machine learningpredictive-modelssvmunbalanced-classes

I'm trying to build a prediction model with SVMs on fairly unbalanced data. My labels/output have three classes, positive, neutral and negative. I would say the positive example makes about 10 – 20% of my data, neutral about 50 – 60%, and negative about 30 – 40%. I'm trying to balance out the classes as the cost associated with incorrect predictions among the classes are not the same. One method was resampling the training data and producing an equally balanced dataset, which was larger than the original. Interestingly, when I do that, I tend to get better predictions for the other class (e.g. when I balanced the data, i increased the number of examples for the positive class, but in out of sample predictions, the negative class did better). Anyone can explain generally why this occurs? If I increase the number of example for the negative class, would I get something similar for the positive class in out of sample predictions (e.g., better predictions)?

Also very much open to other thoughts on how I can address the unbalanced data either through imposing different costs on misclassification or using the class weights in LibSVM (not sure how to select/tune those properly though).

Best Answer

Having different penalties for the margin slack variables for patterns of each class is a better approach than resampling the data. It is asymptotically equivalent to resampling anyway, but is esier to implement and continuous, rather than discrete, so you have more control.

However, choosing the weights is not straightforward. In principal you can work out a theoretical weighting that takes into account the misclassification costs and the differences between training set an operational prior class probabilities, but it will not give the optimal performance. The best thing to do is to select the penalties/weights for each class via minimising the loss (taking into account the misclassification costs) by cross-validation.

Related Solutions

Solved – Optimising for Precision-Recall curves under class imbalance

The ROC curve is insensitive to changes in class imbalance; see Fawcett (2004) "ROC Graphs: Notes and Practical Considerations for Researchers".
Up-sampling the low-frequency class is a reasonable approach.
There are many other ways of dealing with class imbalance. Boosting and bagging are two techniques that come to mind. This seems like a relevant recent study: Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data

P.S. Neat problem; I'd love to know how it turns out.

Machine Learning – How to Determine When to Balance Classes in a Training Data Set

The intuitive reasoning has been explained in the blogpost:

If our goal is Prediction, this will cause a definite bias. And worse, it will be a permanent bias, in the sense that we will not have consistent estimates as the sample size grows.

So, arguably the problem of (artificially) balanced data is worse than the unbalanced case.

Balanced data are good for classification, but you obviously loose information about appearance frequencies, which is going to affect accuracy metrics themselves, as well as production performance.

Let's say you're recognizing hand-written letters from English alphabet (26 letters). Overbalancing every letter appearance will give every letter a probability of being classified (correctly or not) roughly 1/26, so classifier will forget about actual distribution of letters in the original sample. And it's ok when classifier is able to generalize and recognize every letter with high accuracy.

But if accuracy and most importantly generalization isn't "so high" (I can't give you a definition - you can think of it just as a "worst case") - the misclassified points will most-likely equally distribute among all letters, something like:

"A" was misclassified 10 times
"B" was misclassified 10 times
"C" was misclassified 11 times
"D" was misclassified 10 times
...and so on

As opposed to without balancing (assuming that "A" and "C" have much higher probabilities of appearance in text)

"A" was misclassified 3 times
"B" was misclassified 14 times
"C" was misclassified 3 times
"D" was misclassified 14 times
...and so on

So frequent cases will get fewer misclassifications. Whether it's good or not depends on your task. For natural text recognition, one could argue that letters with higher frequencies are more viable, as they would preserve semantics of the original text, bringing the recognition task closer to prediction (where semantics represent tendencies). But if you're trying to recognize something like screenshot of ECDSA-key (more entropy -> less prediction) - keeping data unbalanced wouldn't help. So, again, it depends.

The most important distinction is that the accuracy estimate is, itself, getting biased (as you can see in the balanced alphabet example), so you don't know how the model's behavior is getting affected by most rare or most frequent points.

P.S. You can always track performance of unbalanced classification with Precision/Recall metrics first and decide whether you need to add balancing or not.

EDIT: There is additional confusion that lies in estimation theory precisely in the difference between sample mean and population mean. For instance, you might know (arguably) actual distribution of English letters in the alphabet $p(x_i | \theta)$, but your sample (training set) is not large enough to estimate it correctly (with $p(x_i | \hat \theta)$). So in order to compensate for a $\hat \theta_i - \theta_i$, it is sometimes recommended to rebalance classes according to either population itself or parameters known from a larger sample (thus better estimator). However, in practice there is no guarantee that "larger sample" is identically distributed due to risk of getting biased data on every step (let's say English letters collected from technical literature vs fiction vs the whole library) so balancing could still be harmful.

This answer should also clarify applicability criteria for balancing:

The class imbalance problem is caused by there not being enough patterns belonging to the minority class, not by the ratio of positive and negative patterns itself per se. Generally if you have enough data, the "class imbalance problem" doesn't arise

As a conclusion, artificial balancing is rarely useful if training set is large enough. Absence of statistical data from a larger identically distributed sample also suggests no need for artificial balancing (especially for prediction), otherwise the quality of estimator is as good as "probability to meet a dinosaur":

What is the probability to meet a dinosaur out in the street?

1/2 you either meet a dinosaur or you do not meet a dinosaur

Best Answer

Related Solutions

Solved – Optimising for Precision-Recall curves under class imbalance

Machine Learning – How to Determine When to Balance Classes in a Training Data Set

Related Question