Solved – Probability of class in binary classification

classificationmachine learningprobability

I have a binary classification task with classes 0 and 1 and
the classes are unbalanced (class 1: ~8%).
Data is in the range of ~10k samples and #features may vary but around 50-100.

I am only interested in the probability of an input to be in class 1 and
I will use the predicted probability as an actual probability in another
context later (see below).

Am am wondering how to best model this problem.
My current approach isto use a random forest and predict_proba in scikit-learn and use ROC-AUC as a scoring function. The accuracy is 0.92 as it does not
predict any class 1 with proba > 0.5.

After reading into the subject I came accross many suggestions and terms and I
try to put a little structure in all of this. Specifically:

  1. I saw a couple of other scorers which were suggested, i.e. Cohens kapa,
    Matthews correlation coefficient, PC-AUC and some more.
    Should I look at all of those or is there a favorite for my problem?

  2. I just came accross the probability calibration subject in scikit.
    As I am interested in acual probability I think it's quite relevant.
    Am I right to assume that an additional CalibratedClassifierCV should be
    included in my model as it's based on Decision Trees?
    (is that done automatically in R?)

  3. After looking at some kaggle competitions xgboost seem very promising.
    Is that alsorithm well suited for my problem or do you have other suggestions
    regarding the algorithm (stick to the RF)?

Best Answer

You are considering different classifiers, but in fact this is not a classification problem. You are not interested in classifying your data as zeros and ones, but in predicting probabilities that individual cases are zeros and ones. In this case the usual method of choice, that is designed especially for such problems, is logistic regression. Contrary to popular beliefs, logistic regression is not a classifier, but rather it predicts probabilities, so it does exactly what you want.