Solved – Support vector regression versus kernel ridge regression

kernel trickmachine learningridge regressionsvm

I have a question concerning the difference between support vector regression and kernel regression. I will try to write down all the math so no misunderstandings arise (hopefully).

Let's begin with the minimization problem associated with SVR. Assume we observe responses $y_1,\dots,y_n$ together with predictors $x_1,\dots,x_n$ where $x_i=(x_{i1},\dots,x_{ip})$. Denote by $w_i$ the weight of the kernel function $K_i(\cdot)$ associated with predictor $x_i$, and by $C$ the weight of the loss function. Then the SVR problem expressed as a minimization problem is
$$\begin{align} \text{min} & \sum_i^n w_i^2 + C\sum_i^n (\xi_i^- + \xi_i^+)
\\ s.t. \ \ & \xi_i^- \geq y_i- b -\sum_j^n w_jK_j(x_{i})-\epsilon, \ &i=1,\dots,n \\
& \xi_i^+ \geq b + \sum_j^n w_jK_j(x_{i})- y_i -\epsilon \ &i=1,\dots,n \\ &w_i, \xi_i^+, \xi_i^- \geq 0 \ & i=1,\dots,n\end{align} $$

Now $K_i(x)$ can be some kernel function, centered at $x_i$, e.g., the Gaussian kernel

$$ K_i(x) = \text{e}^{-\gamma||x_i-x||^2} $$

with $\gamma$ as a hyperparameter to be defined.

For the optimality it holds that $$
\xi_i^+ + \xi_i^- = \begin{cases} b + \sum_j^n w_jK_j(x_{i})- y_i &\text{ if } b + \sum_j^n w_jK_j(x_{i})- y_i \geq 0 \\ y_i- b -\sum_j^n w_jK_j(x_{i}) & \text{else,} \end{cases}$$
which can be rewritten as
$$
\xi_i^+ + \xi_i^- = \left| b + \sum_j^n w_jK_j(x_{i})- y_i \right|. $$

If we further assume that $\epsilon\approx 0$, we can write the minimization problem more compactly by

$$\text{min}_{w_i,i=1,\dots,n}\sum_i^n w_i^2 + C\sum_i^n \left| y_i-b-\sum_j^n w_jK_j(x_{i}) \right|.$$

For anyone doing ridge regression, this already looks familiar.

Let us now replace the absolute error with the squared error and premultiply by $\lambda = \frac{1}{C}$ and we arrive at the following minimization problem

$$\text{min}_{w_i,i=1,\dots,n}\lambda \sum_i^n w_i^2 + \sum_i^n \left( y_i-b-\sum_j^n w_jK_j(x_{i})-b_0 \right)^2,$$
which is the equivalent kernel ridge regression problem.

So the difference between support vector regression (with zero slack/epsilon) and kernel ridge regression boils down to having two different type of loss function.

That's really all that is behind it? Considering the tremendous difference in computational performance between performing least squares versus solving a linear program, I wonder why I should use SVR at all. Does the absolute error perform so much better than the squared error on real data to justify the computational effort?

Best Answer

As expected: it depends on what you want. In terms of generalization performance, typically the performance differences are minor.

That said, minimizing the $l_1$-norm has the extremely attractive feature of yielding sparse solutions (the support vectors are a subset of the training set). When doing ridge regression, just like in least-squares SVM, all training instances become support vectors and you end up with a model the size of your training set. A large model requires a lot of memory (obviously) and is slower in prediction.

Related Question