SVM – Interpreting Distance from Hyperplane in Support Vector Machines

machine learningmax-marginsvm

I have a few doubts in understanding SVMs intuitively. Assume we have trained a SVM model for classification using some standard tool like SVMLight or LibSVM.

  1. When we use this model for prediction on test data, the model generates a file having "alpha" values for each test point. If alpha value is positive the test point belongs to Class 1, else it belongs to Class 2. Now, can we say that a test point with greater "alpha" value belongs to corresponding class with "higher" probability?

  2. Similar to first question, when we have a SVM trained. The SV's lie very near to the hyper-plane. So does that mean that SV's belong to that class with high probability? Can we relate the probability of a point belonging to a class with it's distance from the "hyperplane"? Does "alpha" value represent distance from "hyperplane"?

Thanks for your input.

Best Answer

Let me first answer your question in general. The SVM is not a probabilistic model. One reason is that it does not correspond to a normalizable likelihood. For example in regularized least squares you have the loss function $\sum_i \|y_i - \langle w, x_i\rangle - b\|_2^2$ and the regularizer $\|w\|_2^2$. The weight vector is obtained by minimizing the sum of the two. However this is equivalent to maximizing the log-posterior of $w$ given the data $p(w|(y_1,x_1),...,(y_m,x_m)) \propto 1/Z \exp(-\|w\|_2^2)\prod_i \exp(\|y_i - \langle w, x_i\rangle - b\|_2^2)$ which you can see to be product of a Gaussian likelihood and a Gaussian prior on $w$ ($Z$ makes sure that it normalizes). You get to the Gaussian likelihood from the loss function by flipping its sign and exponentiating it. However, if you do that with the loss-function of the SVM, the log-likelihood is not a normalizeable probabilistic model.

There are attempts to turn SVM into one. The most notable one, which is-I think-also implemented in libsvm is:

John Platt: Probabilistic outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods (NIPS 1999): http://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf

To answer your question more specificly: The idea in SVMs indeed is that the further a test vector is from the hyperplane the more it belongs to a certain class (except when it's on the wrong side of course). In that sense, support vectors do not belong to the class with high probability because they either are the ones closest to or on the wrong side of the hyperplane. The $\alpha$ value that you get from libsvm has nothing to do with the $\alpha$ in the decision function. It is rather the output of the decision function $\sum_{i \in SV}\alpha_i k(x,x_i) + b$ (and should therefore be properly called $y$). Since $y = \sum_{i \in SV}\alpha_i k(x,x_i) + b = \langle w, \phi(x) \rangle_{\mathcal H} + b$ where $w$ lives in the reproducing kernel Hilbert space, $y$ is proportional to the signed distance to the hyperplane. It would be if you divide by the norm of $w$, which in kernel terms is $\|w\|_{H} = \sqrt{\sum_{i,j\in SV} \alpha_i \alpha_j k(x_i,x_j)}$.

Related Question