Solved – Unbalanced data with logistic regression: good references

importancelogisticmachine learningregressionunbalanced-classes

I am using the logistic regression framework to formulate a classification model. I have a dataset with 42 'true' (response variable) values and 4400 'false' ones. By using the ‘rule-of-10’ and other considerations, I have selected four independent variables. My aim is solely to understand the relative importance of each of these variables (if at all) towards determining the level of the dependent variable. In this case, I understand that even with an unbalanced dataset (42 versus 4400), logistic regression could still produce good coefficient estimates. Specifically, wikipedia says: ‘Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome.’ This also seems to be intuitively correct if one thinks about how the sigmoid function curve is fitted.

Can you please help me with a reference (preferably, a textbook) for this statement? I checked some versions of the textbook by Hosmer and Lemeshow, but could not find anything.

Best Answer

See Does an unbalanced sample matter when doing logistic regression? where most is answered. Unbalance in itself is not a problem, but only 42 'true' with 4 predictors is on borderline. The wikipedia quote seems somewhat imprecise, but not wrong. This posts give more information.

The 'rule-of-10' you refers can be criticized, see Minimum number of observations for logistic regression? and also Sample size for logistic regression?, especially F Harrell's answers, where references are given.

But, if you want to try anyhow (who wouldn't), note that the usual asymptotic distributions used for logistic regression (leading to Wald tests ...) are usually bad for logistic regression, so the standard errors cannot be trusted. Search this site for Hauck-Donner phenomenon: Logistic regression in R resulted in perfect separation (Hauck-Donner phenomenon). Now what?. A possible remedy is to calculate confidence intervals for parameters via likelihood profiling, see Binomial GLM - non-significant difference between 100% opposite groups of observations for an example.

Related Question