Solved – SVM confidence according to distance from hyperline

classificationprobabilitysvmuncertaintyunsupervised learning

For a probabilistic multi-class classifier we can get probabilities of membership of a new point $x$ to each class $y_i$; in case of 3 classes suppose that we get $P(y_a|x) > P(y_b|x) > P(y_c|x)$, thus the most probable class of x is $y_a$. Now suppose that we have a multi-class svm where we can get scores of membership of $x$ to each class (according to distances from hyperlines); in case of 3 classes suppose that we get $Score(y_a|x), Score(y_b|x), Score(y_c|x)$, How is in this case the first, second and third most likely class of $x$ (without converting these scores to probabilities) ? Usually I get positive and negative values like for instance $Score1 = -8622, Score2 = 5233, Score3 = -665$

Best Answer

It's actually possible to get probabilities out of a Support Vector Machine, which might be more useful and interpretable than an arbitrary "score" value. There are a few approaches for doing this: one reasonable place to start is Platt (1999).

Most SVM packages/libraries implement something like this (for example, the -b 1 option causes LibSVM to produce probabilities). If you're going to roll your own, you should be aware that there are some potential numerical issues, summarized in this note by Lin, Lin, and Weng (2007). They also provide some psuedocode, which might be helpful too.

Edit in response to your comment: It's somewhat unclear to me why you'd prefer a score to a probability, especially since you can get the probability with minimal extra effort. All that said, most of the probability calculations seem like they're derived from the distance between the point and the hyperplane. If you look at Section 2 of the Platt paper, he walks through the motivation and says:

The class conditional densities between the margins are apparently exponential. Bayes' rule on two exponentials suggests using a parametric form of a sigmoid: $$ P(y=1 | f) = \frac{1}{1+\exp(Af+B)}$$ This sigmoid model is equivalent to assuming that the output of the SVM is proportional to the log-likelihood of a positive training example. [MK: $f$ was defined elsewhere to be the raw SVM output].

The rest of the method section describes how to fit the $A$ and $B$ parameters of that sigmoid. In the introduction (Section 1.0 and 1.1), Platt reviews a few other approaches by Vapnik, Wahba, and Hasti & Tibshirani. These methods also use something like the distance to the hyperplane, manipulated in various ways. These all seem to suggest that the distance to the hyperplane contains some useful information, so I guess you could use the raw distance as some (non-linear) measure of confidence.