Solved – Finding an optimal class probability threshold for SVM

calibrationcross-validationsvmthreshold

I've got an imbalanced data set on which I'm training an SVM using cross validation. I'd like to find the optimal class probability threshold that maximizes the F measure.

I've tried doing this by using the class probabilities that are found during cross validation. I've first calibrated these probabilities by training a regression model on them and then using it to find the real probabilities (Platt scaling). I then tried a range of threshold and chose the one that maximizes the F measure.

To test on an test sample, I used the tuned SVM to predicted the class probabilities. I again calibrated these using the previously learned regression model and used the threshold to assign classes.

However this results in a way worse prediction than with the default threshold and I have no clue why. The only thing I suspect that might cause this is the fact that I used the regression model on the same data as I trained it on, as this might cause the threshold to overfit a bit, but I think this would cause the found threshold to just be a bit less optimal and not dramatically worse.

Any thoughts on how I should handle this?

Best Answer

Edit based on comments :

Yes, the idea of plats is to use the output of another model but as an additional feature and hoping that the logistic regression will perform similar as original but, with smoother probabilities. I have tried this and others from http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html, mainly for Kaggle competitions and they don't seem to work that well on large and sparse datasets.

I feel the reason has more to do with the features are not suitable for logistic regression and so the final model performs worse than original xgboost/SVM on test dataset.

Something that has worked for me few times is to have the logistic regression trained on predictions of multiple classifiers (meta learners/stacking) but the effort has not really been worth the gain.

Correct link for R - http://danielnee.com/2014/10/calibrating-classifier-probabilties/