Machine Learning – What Are the Predicted Probabilities from an SVM?

I am using "train" in the Caret package for binary classification with SVM (for the algorithm svmLinear2). I have set 'type = "prob" '. I understand that the probability values farther from 0.5 mean the classification decision was 'easier' , but what exactly do these scores mean? Is it derived from the distance from the hyperplane?

caret's svmLinear2 uses an improved implementation of Platt scaling of the posterior estimates $P(y=1|x)$. The improvements are mostly in terms of numerical stability and speed of convergence, they are described in detail in: Lin et al. (2007) article: A note on Platt's probabilistic outputs for support vector machines. The produced posterior estimates are effectively a rescaled version of the original classifiers scores through a logistic transformation. In that sense, the scores themselves are effectively treated as a log-odds ratio. Your understanding is correct: strictly speaking the SVM score is the distance from the point $X_i$ to the decision boundary.

I think it is more appropriate to say that the probabilities returned correspond to "more certain" classification decisions (rather than "easier") but I accept this is a bit stylistic too. In any case, a probability of 0.50+ indicates that the point $X_i$ is predicted as $y=1$. Please note (again) that these posterior estimates come with the substantial theoretical caveat that scores can be seen as representing log-odds ratio and thus the logistic transformation is relevant. My opinion is that if a method (like SVM classification here) is not designed to produce probabilistic estimates, one should be cautious about how the output is reinterpreted under a probabilistic lens. I would check the calibration plots of the resulting classifiers very carefully for inconsistencies. (See the function caret::calibration for how do this through caret.)

Particular to the mechanics of the routines mentioned in the question: Lin et al. version of the original Platt scaling is implemented internally in e0171::svm. caret::train does not really do any independent computations, caret::predict simply returns the estimates of the posterior $P(y=1|x)$ as calculated by LIBSVM.

A more canonical reference on probability estimates from SVMs is Wu et al. (2004) article: Probability estimates for multi-class classification by pairwise coupling.