Solved – Unbalanced distribution of sample size between groups in logistic regression: should one worry

generalized linear modellogisticregressionsample-sizeunbalanced-classes

I need to fit logistic regression models to a dataset where infection (present/absent) is my dependent variable and neighborhood (three factors: Rich, Poor, Very Poor) my independent variable.

According to a reviewer who (as I) is not well versed in stats, one potential problem with my data is that the variable neighborhood has a quite unevenly distributed sample size for each factor, such that:

Rich = 853  
Poor = 100  
Very poor = 131

The reviewer suggested randomly subsetting the "Rich" group to get a sample of about 100 samples and then meet this alleged assumption of approximately equal sample sizes between groups within the same variable.

Because of the hypothesis behind our study, I need to set "Rich" as the reference category against which to compare the remaining two.

Is the reviewer's suggestion founded? AFAIK, there's no violation of assumption whatsoever in logistic regression if the two categories of the independent variable are unbalanced, or even sparse, and no violation assumption even if it's the dependent variable.

Best Answer

You are right that logistic regression does not make any assumptions about the distribution of your independent variable. What will occur as a result of your situation is that you will have less power than if you had equal $n$s. However, reducing the $n$ in the Rich group will only lessen your power further. Rather, the idea is that if you had the same total $N$, but equally divided, you would have more power. Although written in a different context (viz, t-tests), you can get the general idea from my answer here: How should one interpret the comparison of means from different sample sizes?