Solved – TOO low estimated SVM probability for most of the negative test examples

classificationlibsvmmachine learningMATLABsvm

I am using LIBSVM (as well as the fitcsvm and fitSVMPosterior of Matlab) to train SVM models and get probability estimates. I noticed that the estimated probability for the vast majority of negative test examples are TOO low (e.g. <0.01). I am puzzled as to what could explain this?

One fact that may provide an explanation is that there is a good reason to believe that some negative training examples are in fact positive ones. This would push the classification boundary away from the negative examples and, hence, the too low estimated probability. The relatively low recall achieved also points towards this direction.

[Note that the test datasets are more accurate and have less label noise].

Does this explanation make sense? If not, what could be the explanation?

Best Answer

The first thing you want to do is look at the outputs of your trained csvm (not the posterior probabilities!). What is happening is that the fitSVMPosterior tries to fit a sigmoid through the scores / outputs to generate the posterior probabilities. Plot the scores versus the class label. If the outputs do not seem to follow a sigmoid kind of curve, then you know you are in trouble since fitSVMPosterior will not be able to fit it. The best way to evaluate this is to train your classifier for example on a trainset, and plot the predicted scores on a testset versus their class labels.

Furthermore, you mention you use oversampling to adres the imbalance issue. You can also try use weights to train your SVM instead. Apparently Matlab's behavior is to set the weights in such a way they sum up to the prior probabilities (see here). So definitely try to use just a regular sample of your data as well to evaluate your posteriors.

What could be happening is that the svm model (svmc) that you train, is evaluated using the accuracy during cross validation. Furthermore, the svmc model uses the hingeloss. These performance measures do not say anything about the quality of the posteriors. So the problem is that the model tries to minimize the accuracy, and because of this the posterior quality might not be good. I'm going a bit on a limb here, but I'm going to assume you want a model that outputs proper posterior probabilities. So in my answer I'll detail how to do just that.

There are four options: (1) if you really prefer the SVM model, you could change the cross validation procedure to measure the quality of the posteriors, and you can perform cross validation using this performance measure to obtain better posteriors. (2) If you looked at the scores versus the outputs of the model of the SVM and did saw a pattern that could be classified somehow, but not using a sigmoid, it is possible to write your own function to fit the posteriors using a different model, but this could be a lot of work. (3) You could use a kernelized penalized logistic regression model, which directly optimizes the quality of the posteriors during the training procedure (my recommended solution). (4) You could use a Gaussian processes classification model, but they are quite hard to train in practice.

(1) To do this you would: train your model with some hyperparameters (cost, sigma of the kernel if you use a Gaussian kernel) on the training fold, fit the SVM posterior model on the training fold, and predict the posteriors on the test fold. Then compute the log likelihood on the testset using the posteriors and the class labels. Repeat this for all folds, and choose the hyperparameters that give the best log likelihood. Why? The log likelyhood measures the quality of the posteriors, so if this is optimized by cross validation, possibly you will get better posteriors. However, it might be the case that this will not work very well, since the csvm itself does not aim to give accurate posteriors.

2) Take a look at the Platt scaling that is used by fitSVMPosterior. What you can do instead of a logistic transformation, is to use binning. You can bin the scores, and compute the posterior for each bin. You can find some details here. Possibly this will give you better results, but it will likely be a pain to implement and is not used that often...

3) Penalized kernel logistic regression is similar to SVM's. It uses regularization (which corresponds to the cost parameter of the SVM) and can be used with kernels like the SVM model. Mark Schmidt has a nice Matlab implementation here. Take a look at the file minFunc_examples.m and then look for Kernel logistic regression. This model performs quite well for classification in terms of accuracy and can be used to get proper probability estimates.

4) Gaussian processes naturally compute posterior probabilities. If you want to know more I definitely recommend reading this free book. The website also contains code samples which you can use (however, it will take quite some reading to understand what you want to use when).

Finally, it is possible that all these models estimate small posterior probabilities. Maybe this is just optimal? Therefore, if you want the best posteriors, be sure to compare them using some performance measure. As I said before you can use the log likelihood to evaluate the quality of your posteriors.

But perhaps you have some costs for false positives and false negatives? Try to use the performance measure that you are interested in to evaluate your model. If you are going to compare your models, be sure to use proper cross validation, otherwise you will not be able to tell which model is better ;).

Related Question