Solved – SVM: How to get predicted output from SVM with Gaussian / RBF Kernel? Andrew Ng’s course svmPredict

kernel trickmachine learningMATLABsvm

I've been learning machine learning through Andrew Ng's Coursera. I've completed Andrew's homework on SVM, but it felt wishy washed and I'm having hard time taking it from 0 to finish.

Say you calculated the SVM "model" using svmTrain provided by Andrew Ng. And now you want to see what the predicted output (using Andrew's svmPredict) is for all the training samples you used to calculate the error. Easy enough:

model = svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
predictions = svmPredict(model, Xval);
err = mean(double(predictions ~= yval));

All that make sense. What trips me up is what svmPredict is doing, specifically,

elseif strfind(func2str(model.kernelFunction), 'gaussianKernel')
    % Vectorized RBF Kernel
    % This is equivalent to computing the kernel on every pair of examples
    X1 = sum(X.^2, 2);
    X2 = sum(model.X.^2, 2)';
    K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X'));
    K = model.kernelFunction(1, 0) .^ K;
    K = bsxfun(@times, model.y', K);
    K = bsxfun(@times, model.alphas', K);
    p = sum(K, 2);

What equation/formula is it trying to solve here Gaussian Kernel (RBF)? What does the equation even look like?

Also, if I give you a new input $\vec u$, how would you find its predicted output?

The predicted output for the linear kernel looks pretty readable:

p = X * model.w + model.b;

Which is just solving $y=X\vec w+b$, with each example being a row in $X$.

Any help would be sincerely appreciated.

Best Answer

In kernel SVM you map your data points into a possibly infinite dimension Hilbert space $\mathcal H$. It turns out that $w$, the normal vector to the separating hyperplane (which completely characterizes the decision boundary), has the form $$ w = \sum_{i=1}^n \alpha_i y_i \phi(x_i) $$ where $\phi$ is the map from input data space to feature space and the $\alpha_i$ are from the dual formation (and if $i$ is not a support vector then $\alpha_i = 0$). One of the things that makes SVM practical is that $w$ is a finite linear combination of the $\{\phi(x_i)\}$ which is due to the representer theorem.

We classify a point $x$ via $$ \text{sgn}(b + \langle w, x\rangle) = \text{sgn}\left(b + \sum_{i=1}^n \alpha_i y_i K(x_i, x)\right) $$ where $K$ is the kernel function. This is just determining which side of the hyperplane $x$ is on, and note how we never actually need any $\phi(x)$, we instead only need inner products which are provided by $K$.

There is no way around computing this, but for certain kernels there may be tricks to compute it faster (note how the code for "other Non-linear kernels" is exactly computing what I've described here). In particular, for $$ K(x, x') = \exp\left(-\gamma ||x - x'||^2\right) $$ we have $$ ||x - x'||^2 = ||x||^2 + ||x'||^2 - 2 x^T x' $$ so you can work with vectors rather than doing a double for loop which may be much faster. I don't know Matlab, but in their code, it looks like X1 computes the squared norms of each test point, and X2 computes the squared norm of each training point. Then they compute the sum of these minus 2 times the inner products of the data points, which is exactly the (or at least a) vectorized way to compute all pairwise $||x - x'||^2$. I'm guessing the part with model.kernelFunction(1, 0) is how you go from $||x - x'||^2$ to $\exp\left(-\hat \gamma ||x-x'||^2\right)$, and then the rest is just element-wise multiplication by $\vec \alpha$ and $\vec y$, before summing to get the single signed distance per test point.

Related Question