I need to fit logistic regression models to a dataset where infection (present/absent) is my dependent variable and neighborhood (three factors: Rich, Poor, Very Poor) my independent variable.
According to a reviewer who (as I) is not well versed in stats, one potential problem with my data is that the variable neighborhood has a quite unevenly distributed sample size for each factor, such that:
Rich = 853
Poor = 100
Very poor = 131
The reviewer suggested randomly subsetting the "Rich" group to get a sample of about 100 samples and then meet this alleged assumption of approximately equal sample sizes between groups within the same variable.
Because of the hypothesis behind our study, I need to set "Rich" as the reference category against which to compare the remaining two.
Is the reviewer's suggestion founded? AFAIK, there's no violation of assumption whatsoever in logistic regression if the two categories of the independent variable are unbalanced, or even sparse, and no violation assumption even if it's the dependent variable.
Best Answer
You are right that logistic regression does not make any assumptions about the distribution of your independent variable. What will occur as a result of your situation is that you will have less power than if you had equal $n$s. However, reducing the $n$ in the
Rich
group will only lessen your power further. Rather, the idea is that if you had the same total $N$, but equally divided, you would have more power. Although written in a different context (viz, t-tests), you can get the general idea from my answer here: How should one interpret the comparison of means from different sample sizes?