Solved – How to choose the cutoff probability for a rare event Logistic Regression

classificationgeneralized linear modellogisticregressionroc

I have 100,000 observations (9 dummy indicator variables) with 1000 positives. Logistic Regression should work fine in this case but the cutoff probability puzzles me.

In common literature, we choose 50% cutoff to predict 1s and 0s. I cannot do this as my model gives a maximum value of ~1%. So a threshold can be at 0.007 or somewhere around it.

I do understand ROC curves and how the area under curve can help me choose between two LR models for the same dataset. However, ROC doesn't help me choose an optimum cutoff probability that can be used to test the model on an out-of-sample data.

Should I simply use a cutoff value that minimizes the misclassification rate? (http://www2.sas.com/proceedings/sugi31/210-31.pdf)

Added –> For such a low event rate, my misclassificiation rates are affected by a huge number of false positives. While the rate over all appears good as total universe size is also big, but my model should not have so many false positives (as it is an investment return model). 5/10 coeff are significant.

Best Answer

I disagree that a 50% cutoff is either inherently valid or supported by the literature. The only case where such a cut off might be justified is in a case-control design where the prevalence of the outcome is exactly 50%, but even then the choice would be subject to a few conditions. I think the principal rationale for the choice of cut-off is the desired operating characteristic of the diagnostic test.

A cut-off may be chosen to achieve a desired sensitivity or specificity. For an example of this, consult the medical devices literature. Sensitivity is often set to a fixed amount: examples include 80%, 90%, 95%, 99%, 99.9%, or 99.99%. The sensitivity/specificity tradeoff should be compared to the harms of Type I and Type II errors. Often times, as with statistical testing, the harm of a type I error is greater and so we control that risk. Still, these harms are rarely quantifiable. Because of that, I have major objections to cut-off selection methods which rely on a single measure of predictive accuracy: they convey, incorrectly, that harms can and have been quantified.

Your issue of too many false positives is an example of the contrary: Type II error may be more harmful. Then you may set the threshold to achieve a desired specificity, and report the achieved sensitivity at that threshold.

If you find both are too low to be acceptable for practice, your risk model does not work and it should be rejected.

Sensitivity and specificity are easily calculated or looked up from a table over an entire range of possible cut-off values. The trouble with the ROC is that it omits the specific cut-off information from the graphic. The ROC is therefore irrelevant for choosing a cutoff value.