Logistic Regression – Consequences of Rare Events in Logistic Regression Models

assumptionslogisticrare-events

I know that sample size affects power in any statistical method. There are rules are thumb for how many samples a regression needs for each predictor.

I also hear often that the number of samples in each category in the dependent variable of a logistic regression is important. Why is this?

What are the actual consequences to the logistic regression model when the number of samples in one of the categories is small (rare events)?

Are there rules of thumb that incorporate both the number of predictors and the number of samples in each level of the dependent variable?

Best Answer

The standard rule of thumb for linear (OLS) regression is that you need at least $10$ data per variable or you will be 'approaching' saturation. However, for logistic regression, the corresponding rule of thumb is that you want $15$ data of the less commonly occurring category for every variable.

The issue here is that binary data just don't contain as much information as continuous data. Moreover, you can have perfect predictions with a lot of data, if you only have a couple of actual events. To make an example that is rather extreme, but should be immediately clear, consider a case where you have $N = 300$, and so tried to fit a model with $30$ predictors, but had only $3$ events. You simply can't even estimate the association between most of your $X$-variables and $Y$.

Related Question