SVM Gradient Descent – Why Not Solve SVM with Gradient Descent Instead of Quadratic Programming?

gradient descentkernel trickmachine learningoptimizationsvm

How is SVM optimization implemented in packages like Scikit-Learn? Clearly, SVM is a quadratic programming problem but why not just use gradient descent to update the parameters? Is it because we want to find the values of $\alpha_i$ and find support vectors? I found this link but it doesn't seem to address my question.

Best Answer

Scikit-Learn's SVM is based on LIBSVM (from the documentation of SVC), for which you can find the implementation document here. Specific solver information can be found in section 4.

Generally, gradient descent is a first-order method, which means it utilizes minimal information about the problem (only gradients) and thus converges slowly and might suffer from convergence issues. Taking into account more information about the problem, such as its quadratic nature and linear constraints, can yield faster and more robust algorithms. There are also specific quadratic optimization algorithms developed for SVM (SMO).

Also note that Scikit-Learn has SGDClassifier, which is what you propose - SVM formulation with (stochastic) gradient descent optimization. You may find this question comparing the two interesting.

Best Answer

Related Solutions

Kernelized SVM – Is Gradient Descent Possible?

Solved – Feature scaling in svm: Does it depend on the Kernel

Related Question