Solved – Applying an RBF kernel first and then train using a Linear Classifier

classificationkernel tricksvm

I will start off by saying that I don't have a concrete understanding of what's under the hood of a SVM classifier.

I am interested in using an SVM with the RBF kernel to train a two class classifier. I however find that the training (and even prediction) takes a lot of time when working with the RBF kernel (have been implementing on matlab using libsvm and python using sklearn).

My question is that is it possible to project my data into the higher dimension using an RBF kernel on its own and then apply a linear SVM to this transformed data, i.e. is that going to yield the same results as using an RBF SVM, as long as I use the same C and Gamma?
I am not too sure how kernels are applied so I hope this part makes sense.

If that is true, that I could pre-process the data into the higher dimensional feature space using the RBF kernel and then the training and prediction much fast by using a simple linear classifier.

Best Answer

As the previous answers say, RBF kernels embed data points into an infinite-dimensional space. But it turns out you can approximate that in a finite-dimensional space, as proposed by the paper Random Features for Large-Scale Kernel Machines by Rahimi and Recht, NIPS 2007. The method is also sometimes called "random kitchen sinks."

The gist of the method is: if you want to use an RBF kernel $k(x, y) = \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)$, then you can get a feature map $z : \mathbb R^d \to \mathbb R^D$ such that $k(x, y) \approx z(x)^T z(y)$ by:

  • Sample $D/2$ $d$-dimensional weight vectors $\omega_i \sim \mathcal{N}(0, \frac{1}{\sigma^2} I)$.
  • Define $z(x) = \begin{bmatrix} \sin(\omega_1^T x) & \cos(\omega_1^T x) & \cdots & \sin(\omega_{D/2}^T x) & \cos(\omega_{D/2}^T x) \end{bmatrix}^T$.

Then you can train a linear SVM on these features, which will approximate the RBF-kernel SVM on the original features.

(Note that the linked version of the paper doesn't discuss this particular version, but rather one that seems like it might be better but my (very) recent paper argues is worse.)

This is implemented in scikit-learn, shogun, and JSAT.

There's also a method called Fastfood (Le, Sarlós, and Smola, Fastfood – Approximating Kernel Expansions in Loglinear Time, ICML 13) that speeds up the method for large $d$ and decreases storage requirements. Good implementations are more complicated, though. Here's one for scikit-learn that's okay, but I might work on making more parallelized soon; there's also a matlab one, and a Shark/C++ one that I haven't tested.