Imbalanced Data – How to Perform Logistic Regression with Imbalanced Data?

classificationlogisticmachine learningregressionunbalanced-classes

In class we are learning about the SMOTE (Synthetic Minority Oversampling Technique) algorithm. As I understand, this algorithm can be used to increase the effectiveness of Machine Learning models when the datasets are imbalanced.

An example our professor gave was that suppose you have a dataset with information belonging to medical patients (e.g. age, gender, height, etc.). Suppose that the response variable is whether or not the patient has a specific disease or not – thus, the goal would be to predict whether or not a new patient has a disease (i.e. supervised binary classification). Now, imagine that this is a rare disease and only 5% of the patients in your dataset has the disease. If you fit a Machine Learning model to this data (e.g. Random Forest), the model can "get away" with saying that all patients do not have the disease – and still produce a very good accuracy (although the F-score will be bad)! This is because the Machine Learning model will not have observed enough disease cases to effectively "learn" the difference between diseased cases and non-diseased cases – and therefore poorly generalize, and state that all future patients do not have the disease.

In practice, the SMOTE algorithm is allegedly able to partly rectify this problem. Suppose there is a dataset of 100000 patients and 5% of the patients (5000 patients) have the disease. Suppose you randomly select 10000 patients (10% of the original dataset) who have the disease and the 5000 patients that have the disease – now, if you create a new dataset with these 15000 rows, you might be able to create enough of a "contrast" in such a way that the Machine Learning model will have a better way of distinguishing between Disease and Non-Disease cases. What is described here is a general "Oversampling/Undersampling" technique – the SMOTE algorithm is a more sophisticated version of this same premise.

I have the following question: When reading online, the SMOTE algorithm is often used in more Machine Learning Model contexts compared to "Traditional Statistical Models" contexts. For example, I see SMOTE being used more alongside models such as XGBOOST and Random Forests when compared to SMOTE being used for Regression Models. Suppose in this above example with the medical patients and the imbalanced data, I want to use a Logistic Regression model for interpretability reasons (e.g. Odd's Ratio, Statistical Significance of Coefficients, etc.) : In theory, is there anything which suggests that Oversampling/Undersampling approaches are inherently ill-suited for a Logistic Regression model? Or should such approaches only be used alongside Machine Learning models?

Best Answer

Class imbalance isn’t much of a problem, and many of the apparent issues come from using a surprisingly poor “accuracy” metric. Depending on how much you care about this class and topic, you might want to press your instructor on why class imbalance is an issue at all.

Moreover, SMOTE does not seem to be very good at what it aims to do, even.

So should you use SMOTE when you run a logistic regression? Probably not, but that’s really because you probably shouldn’t use it for any kind of machine learning. I would not consider logistic regression special, however, in using SMOTE, and I would expect the full-credit answer on your exam to be that SMOTE is totally compatible with logistic regression (even if I only consider this the full-credit answer and not the correct answer).

You also mention undersampling. Since class imbalance isn’t really a problem, there is no need to discard precious data to fix a non-problem.

(Yes, Dikran Marsupial gives an interesting example of when class imbalance really is a problem. That’s quite different from what most in machine learning mean when they talk about class imbalance being a problem. I might even argue that example to be a matter of experimental design rather than of model evaluation, the latter of which is where practitioners like your instructor seem to be under the impression that class imbalance poses problems.)

EDIT

The bounty message mentions wanting a reputable source. Frank Harrell’s Regression Modeling Strategies textbook gets into this and has many references to the primary literature. Further, he has (at least) two blog posts on the topic: (1) (2).

EDIT 2

I contest the example your professor gave about a naïve classifier with an accuracy of $95\%$ is giving a good accuracy score. Where I think people get in trouble with this is interpreting accuracy as being akin to $R^2$ in regression, since both are proportions (and that isn’t even totally true for $R^2$). Their logic seems to be that, since $R^2 = 0.95$ probably would be considered strong performance, accuracy of $95\%=0.95$ should be considered impressive, too. The trouble is that all $R^2$ and accuracy have in common is that they are (sort of) proportions. A reasonable view of $R^2$ is that it is a comparison of the square loss of the model and the square loss of a “must-beat” naïve model. A reasonable counterpart for $R^2$ in terms of classification accuracy could be a comparison of model error rate to the error rate of a model that naively guesses the majority class every time. Subtracting from one the ratio of these two error rates then would be analogous to $R^2$, and the performance of the model given by your professor would be zero, correctly highlighting the fact that the model basically does nothing, despite an accuracy score of $95\%$ that (misleadingly) looks like a high $\text{A}$ in school that makes us happy. Why such a metric is not more widespread is a mystery to me. Indeed, even a good UCLA page gives this metric (granted, in a different (but equivalent) form as something like $R^2$ for a classification problem, yet I rarely see it discussed.