Propensity Scores – Using Propensity Score Matching to Reduce ‘Class Imbalance’ Biases

logisticmatchingpropensity-scoresregressionunbalanced-classes

Suppose I have a dataset where 100 patients have the disease (e.g. information such as height, smoking, weight, age, disease status) and 10000 patients do not have the disease (i.e. class imbalance).

I am interested in using Logistic Regression to try and understand what patient characteristics appear to influence the odds of having the disease or not. As such, there are significantly more patients without the disease compared to those who do not.

I fear that fitting a Logistic Regression on the entire dataset might partly invalidate the results as patients without the disease will have more influence in the model estimates. To potentially mitigate this problem, I am thinking of using Propensity Score Matching to select 100 patients who do not have the disease – in a way such that we only select patients without the disease so that they have an "approximate analog" in the disease set. As a result, I will have a dataset with only 200 patients and the the ratio of disease to non-disease will be balanced.

I had the following question: By using this Propensity Score Matching approach, I will end up discarding lots of information corresponding to the non-diseased patients and a result might be forfeiting large amounts of valuable information that might be beneficial to the model. However, by including this information, I fear that I risk "flooding" the model with too much information corresponding to the "non-diseased patients" and suppressing information belonging to the diseased patients.

In general – can Propensity Score Matching be used to mitigate problems/biases associated with class imbalance when fitting regression models to such types of problems?

Notes:

Best Answer

Putting aside whether reducing class imbalance is even a good thing, propensity score matching or any other matching method would be a terrible way to reduce class imbalance. I presume your strategy would be to find 100 non-diseased cases that are similar on all your covariates (i.e., on the propensity score) to the diseased cases. What you will be left with is a group of diseased and a group of non-diseased patients who are identical to each other, meaning you can't predict which is which from the covariates. The goal of matching is to make it so that covariates don't predict the treatment variable (in this case, disease status), but the entire point of your analysis is to be able to predict disease status. So creating groups that are balanced on the covariates is a terrible idea, essentially ruining your study. Do not do this.

You may feel like propensity score matching sounds like a good method because its goal is to reduce imbalance in the covariates, but covariate imbalance is a completely different concept from class imbalance. Covariate imbalance concerns the association between treatment and the covariates (i.e., the very thing you are trying to study), and class imbalance concerns the sample size of the classes to be predicted. While it's true that matching would eliminate class imbalance by discarding a huge number of your non-diseased cases, it would also make it so that disease status cannot be predicted from the covariates. I reiterate, do not do this.

Related Question