Logistic Regression – Consequences of Unbalanced Subgroups of Categorical Variables in Logistic Regression

logisticregressionstepwise regressionunbalanced-classes

I have a dataset of around 120000 (120K) unique individuals. I am fitting a binary logistic regression, where I have around 150 variables to choose from. For the categorical variables, some are very imbalanced. For example, there is a variable that can take on Yes (=1) or No (=0). There are 119K individuals with Yes and the remaining 1K with No.

  1. What are the consequences of having such a variable in the logistic
    regression?

  2. Can it make my results unstable if I have many of these (say 10-20
    similar variables), when using stepwise (forward) or backward
    automatic selection procedures?

  3. What if the predictor / variable is not binary, but has for example
    5 categories where 1 category is largely over or underrepresented?

Best Answer

Stepwise variable selection procedures without using penalized maximum likelihood estimation are invalid. Pre-specify your model or use data reduction (unsupervised learning) first. Much has been written about this. Stepwise variable selection is highly unstable even with nothing but beautiful continuous predictors, and it badly distorts statistical properties of the result.

There are no bad consequences of imbalance, just a fact of life that predictions for infrequent categories will have low precision/wide confidence intervals.

When thinking of imbalance, don't think about relative frequencies but rather use absolute frequencies. If you have 1000 subjects in a category, no matter how low a proportion of the whole that is, you have an excellent information base for making estimates.