Solved – Understanding Kernel Functions for SVMs

classificationkernel trickmachine learningsvm

I am learning about Support Vector Machines, and in particular, those with kernels for non-linear decision boundaries. I understand the concept of projecting the original data to a higher-dimensional space, such that a linear decision boundary can be inserted. But what I don't understand is how the kernel function actually does this mapping.

For example, consider the radial basis function as a kernel $K(x, x') = -\lambda || x – x' || ^2$. What does this actually mean?

Is it saying that for every data point $x$, you find its squared distance to the point x'? And this distance corresponds to its value in the new higher-dimensional space? But what is $x'$ anyway? And also, this is only mapping to a one-dimensional space, because the value is just a distance…not a higher-dimensional space.

Of course, my understanding is totally wrong, but please could somebody explain where I am getting confused? Thanks!

Best Answer

By solving the optimization problem of SVM in its dual form, it turns out that the dependency of the problem on the training data $\{x_i\}_{i=1}^n$ is only through their inner products. That is, you only need $\{x_i^\top x_j\}_{i, j=1}^n$ i.e., inner products of all pairs of points you have. So to train an SVM, you only need to give it the labels $Y=(y_1, \ldots, y_n)$ and a kernel matrix $K$ where $K_{ij} = x_i^\top x_j.$

Now to map each data point $x_i$ to a high-dimensional space, you apply $\phi(x)$. So the kernel matrix becomes

$$K_{ij} = \langle \phi(x_i), \phi(x_j)\rangle$$

where $\langle ,\rangle$ is just a formal notation for an inner product in a general inner product space. It can be seen that as long as we can define an inner product in the high-dimensional space, we can train SVM. We do not even need to compute $\phi(x)$ itself. We only need to compute the inner product $\langle \phi(x_i), \phi(x_j)\rangle$. This is where we set

$$K_{ij} = k(x_i, x_j)$$

for some kernel $k$ of your choice. It is known (by Moore-Aronzajn theorem) that if $k$ is positive definite, then it corresponds to some inner product space i.e., there exists a corresponding feature map $\phi(\cdot)$ such that $k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$.

To answer your question, the kernel $k(x,y)$ does not specify a projection of $x$. It is $\phi(\cdot)$ (which is usually implicit) associated with $k$ that specifies the projection. As an example, the feature map $\phi$ of an RBF kernel $k(x,y) = \exp(-\gamma \|x-y\|_2^2)$ is infinite-dimensional.