Machine Learning – What Are the Predicted Probabilities from an SVM?

caretmachine learningsvm

I am using "train" in the Caret package for binary classification with SVM (for the algorithm svmLinear2). I have set 'type = "prob" '. I understand that the probability values farther from 0.5 mean the classification decision was 'easier' , but what exactly do these scores mean? Is it derived from the distance from the hyperplane?

Best Answer

caret's svmLinear2 uses an improved implementation of Platt scaling of the posterior estimates $P(y=1|x)$. The improvements are mostly in terms of numerical stability and speed of convergence, they are described in detail in: Lin et al. (2007) article: A note on Platt's probabilistic outputs for support vector machines. The produced posterior estimates are effectively a rescaled version of the original classifiers scores through a logistic transformation. In that sense, the scores themselves are effectively treated as a log-odds ratio. Your understanding is correct: strictly speaking the SVM score is the distance from the point $X_i$ to the decision boundary.

I think it is more appropriate to say that the probabilities returned correspond to "more certain" classification decisions (rather than "easier") but I accept this is a bit stylistic too. In any case, a probability of 0.50+ indicates that the point $X_i$ is predicted as $y=1$. Please note (again) that these posterior estimates come with the substantial theoretical caveat that scores can be seen as representing log-odds ratio and thus the logistic transformation is relevant. My opinion is that if a method (like SVM classification here) is not designed to produce probabilistic estimates, one should be cautious about how the output is reinterpreted under a probabilistic lens. I would check the calibration plots of the resulting classifiers very carefully for inconsistencies. (See the function caret::calibration for how do this through caret.)

Particular to the mechanics of the routines mentioned in the question: Lin et al. version of the original Platt scaling is implemented internally in e0171::svm. caret::train does not really do any independent computations, caret::predict simply returns the estimates of the posterior $P(y=1|x)$ as calculated by LIBSVM.

A more canonical reference on probability estimates from SVMs is Wu et al. (2004) article: Probability estimates for multi-class classification by pairwise coupling.