Solved – Using a gaussian kernel in SVM. How exactly is this then written as a dot product

kernel trickmachine learningsvm

I am attempting to use SVMs for my class project. For this project, I have selected the gaussian kernel as, well, the kernel. That is,

$$
k(\mathbf{x}_1, \mathbf{x}_n) = e^{-\gamma ||\mathbf{x}_1 – \mathbf{x}_n ||^2}
$$

What I do not understand, is how this kernel is then 'written as a dot-product'. How do we get around doing that? This is because my professor says that when we finalize the training, we will be performing a dot-product between a new vector and the SVs. But given this kernel, how is this dot-product being done?

Best Answer

Look up "kernel trick". The idea is that, under certain conditions (Mercer's condition), a function $k(x,x')$ can be expressed as a dot product $<\phi(x),~\phi(x')>$, where $\phi$ is a function that transforms $x$ into a high dimensional (possibly infinite) representation.

The trick is that, as long as your optimization problem can be expressed solely with dot products, you do not need to know or compute $\phi$, you simply use the kernel function $k$.

More details on Wikipedia

Related Solutions

Solved – Applying an RBF kernel first and then train using a Linear Classifier

As the previous answers say, RBF kernels embed data points into an infinite-dimensional space. But it turns out you can approximate that in a finite-dimensional space, as proposed by the paper Random Features for Large-Scale Kernel Machines by Rahimi and Recht, NIPS 2007. The method is also sometimes called "random kitchen sinks."

The gist of the method is: if you want to use an RBF kernel $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$, then you can get a feature map $z : \mathbb R^d \to \mathbb R^D$ such that $k(x, y) \approx z(x)^T z(y)$ by:

Sample $D/2$ $d$-dimensional weight vectors $\omega_i \sim \mathcal{N}(0, \frac{1}{\sigma^2} I)$.
Define $z(x) = \begin{bmatrix} \sin(\omega_1^T x) & \cos(\omega_1^T x) & \cdots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x) \end{bmatrix}^T$.

Then you can train a linear SVM on these features, which will approximate the RBF-kernel SVM on the original features.

(Note that the linked version of the paper doesn't discuss this particular version, but rather one that seems like it might be better but my (very) recent paper argues is worse.)

This is implemented in scikit-learn, shogun, and JSAT.

There's also a method called Fastfood (Le, Sarlós, and Smola, Fastfood – Approximating Kernel Expansions in Loglinear Time, ICML 13) that speeds up the method for large $d$ and decreases storage requirements. Good implementations are more complicated, though. Here's one for scikit-learn that's okay, but I might work on making more parallelized soon; there's also a matlab one, and a Shark/C++ one that I haven't tested.

Solved – How to apply a Gaussian radial basis function kernel PCA to nonlinear data

The first problem seems to be that the sign of gamma is wrong (it should be negative: $-15$, as in the definition of the kernel, not as in your code). Alternatively, use exp(-gamma * mat_sq_dists).

The second problem is that you clobber the eigenvectors with your invocation of zip's when you sort the list. The $i$-th eigenvector is eigvecs[:,i], not eigvecs[i,:], according to scipy.linalg.eigh (also: you should prefer eigh to eig because you have a symmetric real matrix).

Replace

< gamma = 15
> gamma = -15

and (to get ordered, real eigenvalues)

< eigvals, eigvecs = np.linalg.eig(K)
> eigvals, eigvecs = scipy.linalg.eigh(K)

and

< eigvals, eigvecs = zip(*sorted(zip(eigvals, eigvecs), reverse=True))
< X_pc1 = eigvecs[0]
> X_pc1 = eigvecs[:,99]

Finally, you can examine scikit-learn's own implementation here.

Best Answer

Related Solutions

Solved – Applying an RBF kernel first and then train using a Linear Classifier

Solved – How to apply a Gaussian radial basis function kernel PCA to nonlinear data

Related Question