How is SVM optimization implemented in packages like Scikit-Learn? Clearly, SVM is a quadratic programming problem but why not just use gradient descent to update the parameters? Is it because we want to find the values of $\alpha_i$ and find support vectors? I found this link but it doesn't seem to address my question.
SVM Gradient Descent – Why Not Solve SVM with Gradient Descent Instead of Quadratic Programming?
gradient descentkernel trickmachine learningoptimizationsvm
Related Solutions
Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:
$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$
$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.
Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).
Now, why is the dual so much more prominent isn't obvious at all: [1]
The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]
Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.
Yes feature scaling depends on the kernel and in general it's a good idea. The kernel is effectively a distance and if different features vary on different scales then this can matter. For the RBF kernel, for instance, we have $$ K(x, x') = \exp\left(-\gamma ||x-x'||^2\right) $$ so if one dimension takes much larger values than others then it will dominate the kernel values and you'll lose some signal in other dimensions. This applies to the linear kernel too.
But this doesn't apply to all kernels, since some have scaling built in. For example, you could do something like the ARD kernel or Mahalanobis kernel with $$ K(x, x') = \exp\left(-\gamma (x-x')^T\hat \Sigma^{-1}(x-x')\right) $$ where $\hat \Sigma$ is the sample covariance matrix or maybe just the diagonal matrix of individual variances. As a function of $x$ and $x'$ this is still PD so it's a valid kernel.
As a general strategy for deciding if this is an issue for any particular kernel, just do what they did in the linked question and try it with data like $x=(1000,1,2,3)$, $x'=(500, .5, 3, 2)$ and see if the first dimension necessarily dominates.
Another way to try to assess a given kernel is to try to see if it inherits scale issues from subfunctions. For example, consider the polynomial kernel $K_{poly}(x,x') = (a+cx^Tx')^d$. We can write this as a function of the linear kernel $x^Tx'$, which we already know to be sensitive to scale, and the map $z \mapsto (a+cz)^d$ won't undo scale issues, so we can see that the polynomial kernel inherits these issues. We can do a similar analysis by writing the RBF kernel as a function of the scale-sensitive $||x-x'||^2$.
Best Answer
Scikit-Learn's SVM is based on LIBSVM (from the documentation of SVC), for which you can find the implementation document here. Specific solver information can be found in section 4.
Generally, gradient descent is a first-order method, which means it utilizes minimal information about the problem (only gradients) and thus converges slowly and might suffer from convergence issues. Taking into account more information about the problem, such as its quadratic nature and linear constraints, can yield faster and more robust algorithms. There are also specific quadratic optimization algorithms developed for SVM (SMO).
Also note that Scikit-Learn has SGDClassifier, which is what you propose - SVM formulation with (stochastic) gradient descent optimization. You may find this question comparing the two interesting.