Solved – Efficient way to classify with SVM

I'm doing a binary classification using SVM classfier, libsvm, where roughly 95% belongs to one class.

The parameters C and gamma are to be set before the actual training takes place.
I followed the tutorial but still can't get any good results.

There is a script that comes with the library that is supposed to help with choosing the right values for parameters but what this script is doing is basically maximizing the accuracy metric (TP+TN)/ALL, so in my case it chooses the parameters to label all data with prevailing class label.

I would like to choose parameters with recall and precision based metrics. How could I approach this problem. Accuracy is a meaningless metric for what I'm doing. Also I'm keen on changing the library libsvm to any other one that can help me with this problem as long as it takes data in the same format.

1 1:0.3 2:0.4 …
-1 1.0.4 2:0.23 and so on

Can anybody help?

UPDATE:
yes I did try both grid.py and easy.py but even though grid search uses logarithmic scale it is extremely slow.I mean even if I run it on just small chunk of my data it takes tens of hours to finish. Is this the most efficient way to use SVM?? Have also tried svmlight but it does exactly the same it labels all data with one label.

UPDATE2:
I reformed my question the better reflect what sort of issues I am facing

Best Answer

I would do two things. First, to address your issue with accuracy due to imbalanced data, you need to set the cost of misclassifying positive and negative examples separately. A reasonable rule of thumb in your case would be to set the cost to 5 for the larger class and to 95 for the smaller class. This way misclassifying 10% of the smaller class will have the same cost as misclassifying 10% of the larger class even though the latter 10% is a much larger number of points. If you use the command line, the command is something like -w0 5 -w1 95. I feel this needs to be done anyway (even though you are using F score for now) because this is what SVM optimizes, so unless you do it, all your F scores will be poor.

Second, to address the issue of speed, I would try pre-computing the kernel. For 26k points this is borderline infeasible, but if you are willing to subsample, you can precompute the kernel once for each gamma and reuse it across all C's.

Best Answer

Related Solutions

Solved – libsvm training very slow on 100K rows, suggestions

Solved – SVM model does not support probabiliy estimates

Related Question