Solved – Why the Brier Score’s better when probabilities are estimated through PAVA instead of Platt Scaling

calibrationrscoring-rulessvm

I've been studying (and applying) SVMs for some time now, mostly through kernlab in R.

kernlab allows probabilistic estimation of the outcomes through Platt Scaling, but the same could be achieved with a Pool Adjacent Violators (PAV) isotonic regression (Zadrozny and Elkan, 2002).

I've been wrapping my head over this and came with a (clunky, but it works, or yet I think it does) code to try the PAV algorithm.

I divided the task into three pairwise binary classification task, estimated the probabilities on the training data and coupled the pairwise probabilities to get class probabilities (Wu, Lin, and Weng, 2004).

Predictions were made on the training set. I set the Cost really low C=0.001 to try to get some misclassifications.

The Brier Score is defined as:

$$BS=\frac{1}N\sum_{t=1}^N\sum_{i=1}^R(f_{ti}-o_{ti})^2 $$

Where $R$ is the number of classes, $N$ is the number of instances, $f_{ti}$ is the forecast probability of the $t$-th instance belonging to the $i$-th class, and $o_{ti}$ is $1$, if the actual class $y_t$ is equal to $i$ and $0$, if the class $y_t$ is different from $i$.

require(isotone)
require(kernlab)

##PAVA SET/VER
data1   <-  iris[1:100,]        #only setosa and versicolor
MR1 <-  c(rep(0,50),rep(1,100)) #target probabilities
KSVM1   <-  ksvm(Species~., data=data1, type="C-svc", kernel="rbfdot", C=.001)
PRED1   <-  predict(KSVM1,iris, type="decision")    #SVM decision function
PAVA1   <-  gpava(PRED1, MR1)               #generalized pool adjacent violators algorithm 

##PAVA SET/VIR
data2   <-  iris[c(1:50,101:150),]      #only setosa and virginica
MR2 <-  c(rep(0,50),rep(1,50),rep(0,50))    #target probabilities
KSVM2   <-  ksvm(Species~., data=data2, type="C-svc", kernel="rbfdot", C=.001)
PRED2   <-  predict(KSVM2,iris, type="decision")
PAVA2   <-  gpava(PRED2, MR2)

##PAVA VER/VIR
data3   <-  iris[51:150,]   #only versicolor and virginica
MR3 <-  c(rep(0,100),rep(1,50)) #target probabilities
KSVM3   <-  ksvm(Species~., data=data3, type="C-svc", kernel="rbfdot", C=.001)
PRED3   <-  predict(KSVM3,iris, type="decision")
PAVA3   <-  gpava(PRED3, MR3)

#Usual pairwise binary SVM
KSVM    <-  ksvm(Species~.,data=iris, type="C-svc", kernel="rbfdot", C=.001,prob.model=TRUE)

#probabilities on the training data through Platt scaling and pairwise coupling
PRED    <-  predict(KSVM,iris,type="probabilities")

#The usual KSVM response based on the sign of the decision function
RES <-  predict(KSVM,iris)

#pairwise probabilities coupling algorithm on kernlab
PROBS   <-  kernlab::couple(cbind(1-PAVA1$x,1-PAVA2$x,1-PAVA3$x))
colnames(PROBS) <- c("setosa","versicolor","virginica")

#Brier score multiclass definition
BRIER.PAVA  <-  sum(
(cbind(rep(1,50),rep(0,50),rep(0,50))-PROBS[1:50,])^2,
(cbind(rep(0,50),rep(1,50),rep(0,50))-PROBS[51:100,])^2,
(cbind(rep(0,50),rep(0,50),rep(1,50))-PROBS[101:150,])^2)/150

#Brier score multiclass definition
BRIER.PLATT <-  sum(
(cbind(rep(1,50),rep(0,50),rep(0,50))-PRED[1:50,])^2,
(cbind(rep(0,50),rep(1,50),rep(0,50))-PRED[51:100,])^2,
(cbind(rep(0,50),rep(0,50),rep(1,50))-PRED[101:150,])^2)/150

BRIER.PAVA

BRIER.PLATT

Soon I'll clean up a bit and write a proper wrapper function to do it all, but this result's really worrisome for me.

BRIER.PAVA 
[1] 0.09801759
BRIER.PLATT 
[1] 0.6710232

The Brier Score I got from the probabilities estimated through PAVA is way better than the one we get on Platt Scaling.

If you check PRED you will see all probabilites fall on the ~0.33 range, while on PROB more extreme values (1 or 0) are expected, which was quite unexpected to me as I'm using a really low C.

References:

Zadrozny, B., and Elkan, C. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.

T.-F. Wu, C.-J. Lin, and Weng, R.C. "Probability estimates for multi-class classification by pairwise coupling." The Journal of Machine Learning Research 5 (2004): 975-1005.

EDIT:

Also, if you check the AUC of the different probabilities, they are quite high.

requires(caTools)

AUC.PAVA<-caTools::colAUC(PROBS,iris$Species)

AUC.PLATT<-caTools::colAUC(PRED,iris$Species)

colMeans(AUC.PAVA)
colMeans(AUC.PLATT)

And here's the result

> colMeans(AUC.PAVA)
    setosa versicolor  virginica 
 0.9988667  0.9988667  0.8455333 
> colMeans(AUC.PLATT)
    setosa versicolor  virginica 
 0.8913333  0.8626667  0.9656000 

Looking at these AUC, I would say Platt Scaling is a really underconfident technique.

Best Answer

Isotonic regression tends to overfit small data, while Platt Scaling is way more relaxed (being based on logistic regression and all). On large data they converge (I tested it on simulated large data).

As on my example above I train and test on the same data, it's obviously overfitted.

Related Question