I want to attempt to use Support Vector Machines (SVMs) on my dataset. Before I attempt the problem though, I was warned that SVMs dont perform well on extremely unbalanced data. In my case, I can have as much as 95-98% 0's and 2-5% 1's.
I tried to find resources which talked about using SVMs on sparse/unbalanced data, but all I could find was 'sparseSVMs' (which use a small amount of support vectors).
I was hoping someone could briefly explain:
- How well SVM would be expected to do with such a dataset
- Which, if any, modifications must be done to the SVM algorithm
- What resources/papers discuss this
Best Answer
Many SVM implementations address this by assigning different weights to positive and negative instances. Essentially you weigh the samples so that the sum of the weights for the positives will be equal to that of the negatives. Of course, in your evaluation of the SVM you have to remember that if 95% of the data is negative, it is trivial to get 95% accuracy by always predicting negative. So you have to make sure your evaluation metrics are also weighted so that they are balanced.
Specifically in
libsvm
, which you added as a tag, there is a flag that allows you to set the class weights (-w
I believe, but check the docs).Finally, from personal experience I can tell you that I often find that an SVM will yield very similar results with or without the weight correction.