Solved – How does logistic regression “elegantly” handle unbalanced classes

classificationlogisticunbalanced-classes

Frank Harrell in this interesting blog post "Classification vs. Prediction" points out that using stratified sampling to handle unbalanced classes is a bad idea, since a classifier trained on an artificially biased data set will then do poorly on real world data set which will be distributed differently from the training data.

He then states that:

Logistic regression on the other hand elegantly handles this situation by either (1) having as predictors the variables that made the prevalence so low, or (2) recalibrating the intercept (only) for another dataset with much higher prevalence.

I'm having a harding digesting this, specifically the idea that logistic regression handles this elegantly:

  • What does he mean in (1): If a disease is really rare how, would we include that as a feature? Or malicious attacks on a network are very rare compared to legitimate logins, how would that be included as a feature?

  • In (2): Doesn't recalibrating the intercept in a logistic regression simply amount to playing around with the classification threshold – which can be achieved with all sorts of binary classification methods (and is achieved implicitly by biasing the training data set) ?

  • Moreover, isn't the bias introduced to the classifier a desirable outcome, given that our purpose is to detect the rare cases (in terms of the precision/recall tradeoff) ?

Best Answer

  • No, we can't include the prevalence as a feature. After all, this is exactly what we are trying to model!

    What FH means here is that if there are features that contribute to the prevalence of the target, these will have appropriate parameter estimates in the logistic regression. If a disease is extremely rare, the intercept will be very small (i.e., negative with a large absolute value). If a certain predictor increases the prevalence, then this predictor's parameter estimate will be positive. (Predictors could include, e.g., a gene SNP, or the result of a blood test.)

    The end result is that logistic regression, if the model is correctly specified, will give you the correct probability for a new sample to be of the target class, even if the target class is overall very rare. This is as it should be. The statistical part of the exercise ends with a probabilistic prediction. What decision should be taken based on this probabilistic prediction is a different matter, which needs to take costs of decisions into account.

  • No, there is no threshold involved in logistic regression. (Nor in any other probabilistic model.) Per above, a threshold (or multiple ones!) may be used later, in weighing the probabilistic prediction against costs.

    Note the context in which FH discusses re-estimating the intercept: it is one of oversampling to address rare outcomes. Oversampling can be used in logistic regression. One would first fit a model to a sample that oversamples the rare outcome we are interested in. This gives us useful parameter estimates for the predictors we have in the model, but the intercept coefficient will be biased high. Then, in a second step, we can nail down the predictor parameter estimates and re-estimate the intercept coefficient only by refitting the model to the full sample.

  • FH and I would argue that no, we should not aim for a precision/recall tradeoff. Instead, we should be aiming for well-calibrated probabilistic predictions, which can then be used in a decision, along with, and I am repeating myself, the consequences of misclassification and other misdecisions. And as a matter of fact, this is exactly what logistic regression does. It does not care at all about precision or recall. What it cares about is the likelihood. Which is just another way of looking at a probabilistic model. And no, bias is not a desirable trait in this context.

Related Question