Let me first answer your question in general. The SVM is not a probabilistic model. One reason is that it does not correspond to a normalizable likelihood. For example in regularized least squares you have the loss function $\sum_i \|y_i - \langle w, x_i\rangle - b\|_2^2$ and the regularizer $\|w\|_2^2$. The weight vector is obtained by minimizing the sum of the two. However this is equivalent to maximizing the log-posterior of $w$ given the data $p(w|(y_1,x_1),...,(y_m,x_m)) \propto 1/Z \exp(-\|w\|_2^2)\prod_i \exp(\|y_i - \langle w, x_i\rangle - b\|_2^2)$ which you can see to be product of a Gaussian likelihood and a Gaussian prior on $w$ ($Z$ makes sure that it normalizes). You get to the Gaussian likelihood from the loss function by flipping its sign and exponentiating it. However, if you do that with the loss-function of the SVM, the log-likelihood is not a normalizeable probabilistic model.
There are attempts to turn SVM into one. The most notable one, which is-I think-also implemented in libsvm is:
John Platt: Probabilistic outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods (NIPS 1999): http://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf
To answer your question more specificly: The idea in SVMs indeed is that the further a test vector is from the hyperplane the more it belongs to a certain class (except when it's on the wrong side of course). In that sense, support vectors do not belong to the class with high probability because they either are the ones closest to or on the wrong side of the hyperplane. The $\alpha$ value that you get from libsvm has nothing to do with the $\alpha$ in the decision function. It is rather the output of the decision function $\sum_{i \in SV}\alpha_i k(x,x_i) + b$ (and should therefore be properly called $y$). Since $y = \sum_{i \in SV}\alpha_i k(x,x_i) + b = \langle w, \phi(x) \rangle_{\mathcal H} + b$ where $w$ lives in the reproducing kernel Hilbert space, $y$ is proportional to the signed distance to the hyperplane. It would be if you divide by the norm of $w$, which in kernel terms is $\|w\|_{H} = \sqrt{\sum_{i,j\in SV} \alpha_i \alpha_j k(x_i,x_j)}$.
I am not an R user, but I suspect it is because you are using the soft-margin support vector machine (which is what I presume "C-svc" means). The support vectors will only lie exactly on the margins for the hard margin SVM (where C is infinite). Essentially the C parameter penalises the degree to which the support vectors are allowed to violate the margin constraint, so if C is less than infinity, the support vectors are allowed to drift away from the margins in the interests of making the margin broader, which often leads to better generalisation.
Best Answer
caret
'ssvmLinear2
uses an improved implementation of Platt scaling of the posterior estimates $P(y=1|x)$. The improvements are mostly in terms of numerical stability and speed of convergence, they are described in detail in: Lin et al. (2007) article: A note on Platt's probabilistic outputs for support vector machines. The produced posterior estimates are effectively a rescaled version of the original classifiers scores through a logistic transformation. In that sense, the scores themselves are effectively treated as a log-odds ratio. Your understanding is correct: strictly speaking the SVM score is the distance from the point $X_i$ to the decision boundary.I think it is more appropriate to say that the probabilities returned correspond to "more certain" classification decisions (rather than "easier") but I accept this is a bit stylistic too. In any case, a probability of 0.50+ indicates that the point $X_i$ is predicted as $y=1$. Please note (again) that these posterior estimates come with the substantial theoretical caveat that scores can be seen as representing log-odds ratio and thus the logistic transformation is relevant. My opinion is that if a method (like SVM classification here) is not designed to produce probabilistic estimates, one should be cautious about how the output is reinterpreted under a probabilistic lens. I would check the calibration plots of the resulting classifiers very carefully for inconsistencies. (See the function
caret::calibration
for how do this throughcaret
.)Particular to the mechanics of the routines mentioned in the question: Lin et al. version of the original Platt scaling is implemented internally in
e0171::svm
.caret::train
does not really do any independent computations,caret::predict
simply returns the estimates of the posterior $P(y=1|x)$ as calculated by LIBSVM.A more canonical reference on probability estimates from SVMs is Wu et al. (2004) article: Probability estimates for multi-class classification by pairwise coupling.