Machine Learning – Best Way to Handle Unbalanced Multiclass Dataset with SVM

machine learningpredictive-modelssvmunbalanced-classes

I'm trying to build a prediction model with SVMs on fairly unbalanced data. My labels/output have three classes, positive, neutral and negative. I would say the positive example makes about 10 – 20% of my data, neutral about 50 – 60%, and negative about 30 – 40%. I'm trying to balance out the classes as the cost associated with incorrect predictions among the classes are not the same. One method was resampling the training data and producing an equally balanced dataset, which was larger than the original. Interestingly, when I do that, I tend to get better predictions for the other class (e.g. when I balanced the data, i increased the number of examples for the positive class, but in out of sample predictions, the negative class did better). Anyone can explain generally why this occurs? If I increase the number of example for the negative class, would I get something similar for the positive class in out of sample predictions (e.g., better predictions)?

Also very much open to other thoughts on how I can address the unbalanced data either through imposing different costs on misclassification or using the class weights in LibSVM (not sure how to select/tune those properly though).

Best Answer

Having different penalties for the margin slack variables for patterns of each class is a better approach than resampling the data. It is asymptotically equivalent to resampling anyway, but is esier to implement and continuous, rather than discrete, so you have more control.

However, choosing the weights is not straightforward. In principal you can work out a theoretical weighting that takes into account the misclassification costs and the differences between training set an operational prior class probabilities, but it will not give the optimal performance. The best thing to do is to select the penalties/weights for each class via minimising the loss (taking into account the misclassification costs) by cross-validation.