Solved – libsvm training very slow on 100K rows, suggestions

svm

I'm trying to run the libsvm-provided wrapper script easy.py on a training set of 100K rows, each row has ~300 features. The feature data is relatively sparse, say only 1/10th are non-zero values.

The script is excruciatingly slow, I'm talking days (or more). I ran the same script on 1% of the data, and it finished in about 20 minutes, with some reasonable looking results, so it looks like the input data / format is correct and there are no obvious issues with it.

I found the documentation for libsvm to be somewhat lacking and not very helpful on practical issues like performance. Their FAQ is silent on these matters:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html

Has anyone experienced similar issues with SVM training speed? Do you know of more suitable libraries or specific strategies to try out in such cases?

Best Answer

I've seen liblinear runtimes very sensitive to tol; try tol=.1, and if possible linear not rbf. How many classes do you have ? How much memory do you have ? Monitor real / virtual with "top" or the like.

Stochastic gradient descent, SGDClassifier in scikits.learn is fast. For example, on Mnist handwritten digit data, 10k rows x 768 features, 80 % of the raw data 0, -= mean and /= std:

 12 sec  sgd        mnist28 (10000, 784)  tol 0.1  C 1  penalty l2  correct 89.6 %
321 sec  LinearSVC  mnist28 (10000, 784)  tol 0.1  C 1  penalty l2  correct 86.6 %

This is with no tuning nor cross-validation; your mileage will vary.

Added: see also Sofia-ml -- comments anyone ?

And please post what worked / what didn't.