Solved – Logistic Regression Cutoff Values for Multiple Models

accuracylogisticregression-strategies

I understand that once the logistic regression model has output probabilities, a cutoff value for classifying probabilities of new observations is decided for a model to optimize some metric like sensitivity, specificity, AUC, etc. Then a confusion matrix for that model is constructed and then that metric to optimize is calculated. I know the cutoff value decision is kind of 'trial and error' to play with to get a value you are comfortable with out of the confusion matrix.

My question: what if you are comparing multiple logistic regression models for performance? Do you use the same cutoff value for the predicted probabilities of each model? Or can you use different cutoff values in each model? I guess my concern is if you tailor the cutoff value to each model you can possibly get them all to be very similar in performance, thus making the cutoff value to key variable in performance, and not something like AIC in the model building stage.

EDIT: To clarify my question -> Should model performance between different models be decided in the training set based on AIC and other criteria, and the winning model projected onto the test set to confirm performance? Or do the best models you have found in training need to all be projected onto the test set to compare performance?

Best Answer

You have misunderstood logistic regression. Logistic regression is a probability estimator. Cutoffs and improper accuracy scores should play no role in logistic regression analysis and will result in arbitrariness and loss of power/precision.

When you say 'multiple logistic models' you are implying that you don't know how to specify 'the' model or that you are doing problematic variable selection. Please elaborate on why you need to compare multiple models. Note that such comparisons should be based on gold standard methods such as likelihood-based measures.

Related Question