Solved – LogisticRegression – binary classification, “custom threshold”

I have a binary classification problem that I am trying to solve with sklearn's Logistic Regression. I am aware of the fact that the predict_proba() function is apparently only an approximation of the "real" probability and somewhat fuzzy. However, after reading some threads, e.g. here, I was wondering whether I would violate some assumptions about the LR classification by customizing the threshold for the decision.

In the end, my classification problem allows me to make mistakes in one class but preferably not in the other, i.e. maximize the recall for the "important" class.
It appears to be a very intuitive solution to just shift the decision boundary in favor of one class and even have asymmetric decision boundaries. Or to put it differently, only assume the prediction is "correct" if probability > 0.75. Otherwise, don't make a prediction. Also, is this something with a known keyword in the ML world?

Edit:

I should add that the naive solution to classify everything in one class is not applicable (:

Best Answer

As written in numerous places on this site, logistic regression is a probability estimator. Any decision you want to make takes the predicted probability from the fitted logistic regression model, applies a utility function you specify, and chooses the decision with the highest expected utility. The utility should not be incorporated by playing with thresholds for the model (at least not usually). Having probabilities also gives you the luxury of making no decision at all for class calls.

Best Answer

Related Solutions

Solved – Using predict_proba with sklearn’s multiclass SVC

Solved – Accuracy metrics for multi class classification in Python

Related Question