Logistic Regression – Identifying and Preventing Overfitting in Models

logisticoverfittingregression-strategies

Is it possible to overfit a logistic regression model?
I saw a video saying that if my area under the ROC curve is higher than 95%, then its very likely to be over fitted, but is it possible to overfit a logistic regression model?

Best Answer

Yes, you can overfit logistic regression models. But first, I'd like to address the point about the AUC (Area Under the Receiver Operating Characteristic Curve): There are no universal rules of thumb with the AUC, ever ever ever.

What the AUC is is the probability that a randomly sampled positive (or case) will have a higher marker value than a negative (or control) because the AUC is mathematically equivalent to the U statistic.

What the AUC is not is a standardized measure of predictive accuracy. Highly deterministic events can have single predictor AUCs of 95% or higher (such as in controlled mechatronics, robotics, or optics), some complex multivariable logistic risk prediction models have AUCs of 64% or lower such as breast cancer risk prediction, and those are respectably high levels of predictive accuracy.

A sensible AUC value, as with a power analysis, is prespecified by gathering knowledge of the background and aims of a study apriori. The doctor/engineer describes what they want, and you, the statistician, resolve on a target AUC value for your predictive model. Then begins the investigation.

It is indeed possible to overfit a logistic regression model. Aside from linear dependence (if the model matrix is of deficient rank), you can also have perfect concordance, or that is the plot of fitted values against Y perfectly discriminates cases and controls. In that case, your parameters have not converged but simply reside somewhere on the boundary space that gives a likelihood of $\infty$. Sometimes, however, the AUC is 1 by random chance alone.

There's another type of bias that arises from adding too many predictors to the model, and that's small sample bias. In general, the log odds ratios of a logistic regression model tend toward a biased factor of $2\beta$ because of non-collapsibility of the odds ratio and zero cell counts. In inference, this is handled using conditional logistic regression to control for confounding and precision variables in stratified analyses. However, in prediction, you're SooL. There is no generalizable prediction when you have $p \gg n \pi(1-\pi)$, ($\pi = \mbox{Prob}(Y=1)$) because you're guaranteed to have modeled the "data" and not the "trend" at that point. High dimensional (large $p$) prediction of binary outcomes is better done with machine learning methods. Understanding linear discriminant analysis, partial least squares, nearest neighbor prediction, boosting, and random forests would be a very good place to start.