SVM vs Decision Tree – Why SVM May Underperform on Same Data

classificationmachine learningscikit learnsvm

I am new to machine learning and try to use scikit-learn(sklearn) to deal with a classification problem. Both DecisionTree and SVM can train a classifier for this problem.

I use sklearn.ensemble.RandomForestClassifier and sklearn.svm.SVC to fit the same training data(about 500,000 entries with 50 features per entry). The RandomForestClassifier comes out with a classifier in about one minute. The SVC uses more than 24 hours and still keeps running.

Why does the SVC perform so inefficiently? Is the data set too big for SVC? Is SVC improper for such problem?

Best Answer

Possibilities include the use of an inappropriate kernel (e.g. a linear kernel for a non-linear problem), poor choice of kernel and regularisation hyper-parameters. Good model selection (choice of kernel and hyper-parameter tuning is the key to getting good performance from SVMs, they can only be expected to give good results when used correctly).

SVMs often do take a long time to train, this is especially true when the choice of kernel and particularly regularisation parameter means that almost all the data end up as support vectors (the sparsity of SVMs is a handy by-product, nothing more).

Lastly, the no free lunch theorems say that there is no a-priori superiority for any classifier system over the others, so the best classifier for a particular task is itself task-dependent. However there is more compelling theory for the SVM that suggests it is likely to be better choice than many other approaches for many problems.