Logistic Regression – Importance of Balanced Samples in Logistic Regression

faqlogisticregressionsample-sizeunbalanced-classes

Okay, so I think I have a decent enough sample, taking into account the 20:1 rule of thumb: a fairly large sample (N=374) for a total of 7 candidate predictor variables.

My problem is the following: whatever set of predictor variables I use, the classifications never get better than a specificity of 100% and a sensitivity of 0%. However unsatisfactory, this could actually be the best possible result, given the set of candidate predictor variables (from which I can't deviate).

But, I couldn't help but think I could do better, so I noticed that the categories of the dependent variable were quite unevenly balanced, almost 4:1. Could a more balanced subsample improve classifications?

Best Answer

Balance in the Training Set

For logistic regression models unbalanced training data affects only the estimate of the model intercept (although this of course skews all the predicted probabilities, which in turn compromises your predictions). Fortunately the intercept correction is straightforward: Provided you know, or can guess, the true proportion of 0s and 1s and know the proportions in the training set you can apply a rare events correction to the intercept. Details are in King and Zeng (2001) [PDF].

These 'rare event corrections' were designed for case control research designs, mostly used in epidemiology, that select cases by choosing a fixed, usually balanced number of 0 cases and 1 cases, and then need to correct for the resulting sample selection bias. Indeed, you might train your classifier the same way. Pick a nice balanced sample and then correct the intercept to take into account the fact that you've selected on the dependent variable to learn more about rarer classes than a random sample would be able to tell you.

Making Predictions

On a related but distinct topic: Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.