Solved – Estimate SVM a posteriori probabilities with platt’s method does not always work

bayesprobabilitysvm

I have a problem..

I'm trying to create a multiclass SVM with probability output. The SVM is working so far, what means, that the accuracy is ok (see the last picture). But the probability estimation does not work so far.

I use the one vs one multiclass encoding.
Now as Platt (http://en.wikipedia.org/wiki/Platt_scaling) suggests, I'm fitting a sigmoid in each binary model by minimizing the negative log likelihood

$$ \min_{A,B} -\sum_{i=1}^N t_i\log(p_i)+(1-t_i)\log(1-p_i),$$ where $ p_i = \frac{1}{1+\exp(Af_i+B)}$, $N$ the Number of samples and $f_i$ is the uncalibrated output (the score) of the SVM.

These values are the leave one out crossvalidation values to prevent overfitting. But there is no much difference between trainings data, 10 fold cv and loo cv.

The improved algorithm by Lin et all works well, I can compare it to matlab's fminfunc.

The problem: The fits don't fit sometimes, see the pictures below. There are 3 times 2 pictures. The first shows the featurespace of the binary SVM (for the one vs one coding) with the uncalibrated output as contour lines.
The second picture shows a subplot of the probability density of the svm output for each class (estimated via ksdensity in matlab), the second subplot shall be the posteriori possiblity. They were calculated via Baye's theorem. There might be an error in the calculation, so I write it down:

$$ p(y_1|f)=\frac{p(f|y_1)p(y_1)}{\sum_{j=1}^{\text{all classes}} p(x|y_j)p(y_j)} $$ where $y_i$ is the predicted ith class and $f$ is again the svm output.
I can calculate the prior $p(y_j)$ simply by counting the positive and negative classes, right? In this case it's $0.5$

My problem is the likelihood. As far as I have understood in the bayesian case the likelihood is a probability density that shows us the distribution of the svm output by given class, correct? I thought this is exactly I'm showing in the first subplot, or am I wrong?
My code follows:

        [pPlus,xi] = ksdensity(score(yBinary == 1),[-1.5:.0001:1.5]);
        [pMinus,xi] = ksdensity(score(yBinary == -1),[-1.5:.0001:1.5]);

        evid = pPlus* Nplus/(Nplus+Nminus) + pMinus* Nminus/(Nplus+Nminus);
        postPlus = pPlus *  Nplus/(Nplus+Nminus)./(evid); % likelihood * prior / evidence
        postMinus = pMinus* Nminus/(Nplus+Nminus)./(evid);

Nplus is the number of positive classes, Nminus the number of negative. score is the SVM output, xi is simply the interval on the x-axis. pPlus is the positive likelihood.

Is there an error in the formulars or do you need further code/data?
I ask because there are some oscillations in the second picture. And in the 4th picture the sigmoid doesn't fit at all, whereas the sigmoid in the 6th picture is ok.

Are the subplot pictures plausible, or might there be an error? Or is just the method not working? As I said above, the optimization works and gives different values for each sigmoid (though it might be impossible to see here), but fminfunc and the proposed algorithm support the same values.

By the way the last picture is showing the predicted classes of the multiclass SVM, maybe you can understand the pictures of the binary svms better.

If you have any ideas, you're welcome! Thank you very much.

Binary SVM class 1 vs 2 with decision values
enter image description here
Binary SVM class 1 vs 3 with decision values
enter image description here
Binary SVM class 1 vs 4 with decision values
enter image description here
featurespace result

Best Answer

This response is obviously way outdated but there is a known issue with Platt scaling. I would not be surprised if SciPy and Matlab are using similar algorithms to infer probabilities from SVM output. The accuracy of the output is dependent on the sample size and, in an unpredictable manner, the estimated probabilities may yield completely dubious results. Check out this thread on stackoverflow.

Related Question