Solved – Isn’t caret SVM classification wrong when class probabilities are included

*Please note this question is about the Platt probabilistic output and SVM class assignment, not about the code or the package itself. It just happens to be the code where I stumbled on the issue.

In another question I asked about bad models coming from caret and associated kernlab when prob.model=TRUE. I found the answer myself, in both stackoverflow and from Max Kuhn himself:

> predict(newSVM, df[43,-1]) [1] O32078 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
> predict(newSVM, df[43,-1], type = "probabilities")
     O27479     O31403     O32057    O32059    O32060     O32078
[1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394
     O32089     O32663     O32668     O32676
[1,] 0.04890477 0.05210836 0.09838892 0.07284396

Note that, based on the probability model, the class with the largest
probability is O32057 (p = 0.24) while the basic SVM model predicts
O32078 (p = 0.16).

Somebody (maybe me) saw this discrepancy and that led to me to follow
this rule:

if(prob.model = TRUE) use the class with the maximum probability   
  else use the class prediction from ksvm().

Therefore:

predict(svm.m1, df[43,-1])
 [1] O32057
 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676

Isn't that innacurate? kernlab searches for the optimal probability cutoff that minimizes error, that's why the assigned class and the maximum probability don't match: they don't have to.

Check this reproducible example. I excluded two cherrypicked virginica samples.

require(kernlab);require(caret);
#kernel=polynomial; degree=3; scale=0.1; C=0.31
set.seed(101);SVM<-ksvm(Species~., data=iris[-c(135,150),], kernel='polydot',C=.31, kpar=list( scale=.1, degree=3), prob.model=T)

Here's the resulting model

> SVM
Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 0.31 

Polynomial kernel function. 
 Hyperparameters : degree =  3  scale =  0.1  offset =  1 

Number of Support Vectors : 58 

Objective Function Value : -1.4591 -0.7955 -10.2392 
Training error : 0.033784 
Probability model included.

Now let's check the predicted class probabilities in those two samples

> predict(SVM, iris[c(135,150),-5], type="probabilities")
          setosa versicolor virginica
[1,] 0.008286638  0.4414114  0.550302
[2,] 0.013824451  0.3035556  0.682620

And the class predictions

> predict(SVM, iris[c(135,150),-5])
[1] versicolor virginica 
Levels: setosa versicolor virginica

Sample 150 was assigned to virginica, with a class probability of around 0.68. Sample 135 was assigned to versicolor with a probability of around 0.44, yet virginica probability nicely sits around 0.55.
Looking at several CV folds, we perceive that kernlab only assigns virginica when its probability is over a given value (way higher than 0.5). That's the cutoff I mentioned, and it happens thanks to the well known bad clustering in iris between virginica and versicolor.

So, am I right on these suppositions and therefore is caret class assignment model (maximum probability) wrong?

EDIT:
I've been experimenting with pairwise probability coupling of Platt scaling (logistic regression fit), isotononic regression and a model I'm working on. A weakness (?) I perceived in Platt's model is the probability isn't bound to be 0.5 when the binary SVM decision output is 0, which is the expected result as the instance would lie exactly on the separating hyperplane.

Best Answer

After learning one more year I came to the conclusion it isn't wrong per se, but it's debatable; from the caret perspective I don't think it should change the outputs from the learners. Now, some people might get confused seeing that kind of behavior, you would try to minimize risk, always outputing the highest probability class. The thing is those are estimates and should be taken for that.

It's a matter of opinion, and it arises due to the unnecessary dichotomization of the results. I actually perceived it trying to ditch accuracy for AUC.

Best Answer

Related Solutions

Solved – Do categorical variables have to be dumthe coded in SVM

Solved – libsvm on MATLAB with rbf kernel: Compute distance from hyperplane

Related Question