Solved – Non-linear SVM classification with RBF kernel

classificationkernel tricknonlinearsvm

I'm implementing a non-linear SVM classifier with RBF kernel. I was told that the only difference from a normal SVM was that I had to simply replace the dot product with a kernel function:
$$
K(x_i,x_j)=\exp\left(-\frac{||x_i-x_j||^2}{2\sigma^2}\right)
$$
I know how a normal linear SVM works, that is, after solving the quadratic optimization problem (dual task), I compute the optimal dividing hyperplane as
$$
w^*=\sum_{i \in SV} h_i y_i x_i
$$
and the offset of the hyperplane
$$
b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i – \sum_{j=1}^N\left(h_j y_j x_j^T x_i\right)\right)
$$
respectively, where $x$ is a list of my training vectors, $y$ are their respective labels ($y_i \in \{-1,1\}$), $h$ are the Lagrangian coefficients and $SV$ is a set of support vectors. After that, I can use $w^*$ and $b^*$ alone to easily classify: $c_x=\text{sign}(w^Tx+b)$.

However, I don't think I can do such a thing with an RBF kernel. I found some materials suggesting that $K(x,y)=\phi(x)\phi(y)$. That would make it easy. Nevertheless, I don't think such a decomposition exists for this kernel and it's not mentioned anywhere. Is the situation so that all the support vectors are needed for the classification? If so, how do I classify in that case?

Best Answer

Let $\mathcal{X}$ represent your input space i.e the space where your data points resides. Consider a function $\Phi:\mathcal{X} \rightarrow \mathcal{F}$ such that it takes a point from your input space $\mathcal{X}$ and maps it to a point in $\mathcal{F}$. Now, let us say that we have mapped all your data points from $\mathcal{X}$ to this new space $\mathcal{F}$. Now, if you try to solve the normal linear svm in this new space $\mathcal{F}$ instead of $\mathcal{X}$, you will notice that all the earlier working simply look the same, except that all the points $x_i$ are represented as $\Phi(x_i)$ and instead of using $x^Ty$ (dot product) which is the natural inner product for Euclidean space, we replace it with $\langle \Phi(x), \Phi(y) \rangle$ which represents the natural inner product in the new space $\mathcal{F}$. So, at the end, your $w^*$ would look like,

$$ w^*=\sum_{i \in SV} h_i y_i \Phi(x_i) $$

and hence, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i \langle \Phi(x_i), \Phi(x) \rangle $$

Similarly, $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j \langle \Phi(x_j), \Phi(x_i)\rangle\right)\right) $$

and your classification rule looks like: $c_x=\text{sign}(\langle w, \Phi(x) \rangle+b)$.

So far so good, there is nothing new, since we have simply applied the normal linear SVM to just a different space. However, the magic part is this -

Let us say that there exists a function $k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ such that $k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$. Then, we can replace all the dot products above with $k(x_i, x_j)$. Such a $k$ is called a kernel function.

Therefore, your $w^*$ and $b^*$ look like, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i k(x_i, x) $$ $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j k(x_j, x_i)\right)\right) $$

For which kernel functions is the above substitution valid? Well, that's a slightly involved question and you might want to take up proper reading material to understand those implications. However, I will just add that the above holds true for RBF Kernel.

To answer your question, "Is the situation so that all the support vectors are needed for the classification?" Yes. As you may notice above, we compute the inner product of $w$ with $x$ instead of computing $w$ explicitly. This requires us to retain all the support vectors for classification.

Note: The $h_i$'s in the final section here are solution to dual of the SVM in the space $\mathcal{F}$ and not $\mathcal{X}$. Does that mean that we need to know $\Phi$ function explicitly? Luckily, no. If you look at the dual objective, it consists only of inner product and since we have $k$ that allows us to compute the inner product directly, we don't need to know $\Phi$ explicitly. The dual objective simply looks like, $$ \max \sum_i h_i - \sum_{i,j} y_i y_j h_i h_j k(x_i, x_j) \\ \text{subject to : } \sum_i y_i h_i = 0, h_i \geq 0 $$

Related Solutions

Solved – Why bother with the dual problem when fitting SVM

Based on the lecture notes referenced in @user765195's answer (thanks!), the most apparent reasons seem to be:

Solving the primal problem, we obtain the optimal $w$, but know nothing about the $\alpha_i$. In order to classify a query point $x$ we need to explicitly compute the scalar product $w^Tx$, which may be expensive if $d$ is large.

Solving the dual problem, we obtain the $\alpha_i$ (where $\alpha_i = 0$ for all but a few points - the support vectors). In order to classify a query point $x$, we calculate

$$ w^Tx + w_0 = \left(\sum_{i=1}^{n}{\alpha_i y_i x_i} \right)^T x + w_0 = \sum_{i=1}^{n}{\alpha_i y_i \langle x_i, x \rangle} + w_0 $$

This term is very efficiently calculated if there are only few support vectors. Further, since we now have a scalar product only involving data vectors, we may apply the kernel trick.

Solved – Plotting the decision boundary of a kernel SVM (RBF)

I figured out what is needed to be done. Actually, it's something simple, but its seems I had a matlaboid bug... Here is the code and the resulting figure for the "XOR" binary classification problem.

gamma     = getGamma();
b         = getB();
points_x1 = linspace(xLimits(1), xLimits(2), 100);
points_x2 = linspace(yLimits(1), yLimits(2), 100);
[X1, X2]  = meshgrid(points_x1, points_x2);

% Initialize f
f = ones(length(points_x1), length(points_x2))*rho;

% Iter. all SVs
for i=1:N_sv
    alpha_i = getAlpha(i);
    sv_i    = getSV(i);
    for j=1:length(points_x1)
        for k=1:length(points_x2)
            x = [points_x1(j);points_x2(k)];
            f(j,k) = f(j,k) + alpha_i*y_i*kernel_func(gamma, x, sv_i);
        end
    end    
end

surf(X1,X2,f);
shading interp;
lighting phong;
alpha(.6)

contourf(X1, X2, f, 1);

where the function

function k = kernel_func(gamma, x, x_i)
    k = exp(-gamma*norm(x - x_i)^2);
end

just produces the kernel function (RBF kernel), $k(\mathbf{x},\mathbf{x}')=\operatorname{exp}\left(-\gamma\lVert\mathbf{x}-\mathbf{x}'\rVert^2\right)$.

Here is the result for the XOR problem. Here $\gamma=4$.

enter image description here

Best Answer

Related Solutions

Solved – Why bother with the dual problem when fitting SVM

Solved – Plotting the decision boundary of a kernel SVM (RBF)

Related Question