Solved – Is Gradient Descent possible for kernelized SVMs (if so, why do people use Quadratic Programming)

Why do people use Quadratic Programming techniques (such as SMO) when dealing with kernelized SVMs? What is wrong with Gradient Descent? Is it impossible to use with kernels or is it just too slow (and why?).

Here is a little more context: trying to understand SVMs a bit better, I used Gradient Descent to train a linear SVM classifier using the following cost function:

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 – y^{(i)} (\mathbf{w}^t \cdot \mathbf{x}^{(i)} + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

I am using the following notations:

$\mathbf{w}$ is the model's feature weights and $b$ is its bias parameter.
$\mathbf{x}^{(i)}$ is the $i^\text{th}$ training instance's feature vector.
$y^{(i)}$ is the target class (-1 or 1) for the $i^\text{th}$ instance.
$m$ is the number of training instances.
$C$ is the regularization hyperparameter.

I derived a (sub)gradient vector (with regards to $\mathbf{w}$ and $b$) from this equation, and Gradient Descent worked just fine.

Now I would like to tackle non-linear problems. Can I just replace all dot products $\mathbf{u}^t \cdot \mathbf{v}$ with $K(\mathbf{u}, \mathbf{v})$ in the cost function, where $K$ is the kernel function (for example the Gaussian RBF, $K(\mathbf{u}, \mathbf{v}) = e^{-\gamma \|\mathbf{u} – \mathbf{v}\|^2}$), then use calculus to derive a (sub)gradient vector and go ahead with Gradient Descent?

If it is too slow, why is that? Is the cost function not convex? Or is it because the gradient changes too fast (it is not Lipschitz continuous) so the algorithm keeps jumping across valleys during the descent, so it converges very slowly? But even then, how can it be worse than Quadratic Programming's time complexity, which is $O({n_\text{samples}}^2 \times n_\text{features})$? If it's a matter of local minima, can't Stochastic GD with simulated annealing overcome them?

Best Answer

Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:

$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$

$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.

Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).

Now, why is the dual so much more prominent isn't obvious at all: [1]

The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]

Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.

_{[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.}

Best Answer

Related Solutions

Solved – How to set the step size for stochastic gradient descent such that its provable it will converge

Solved – Why use gradient descent with neural networks

Related Question