Solved – Sample size of the levels of a categorical variables

categorical dataregressionregression coefficients

Is there a generally acceptable sample size for the levels of a categorical variable included in a regression analysis? For example, if we have a variable color with 3 levels:

  • 5 reds
  • 140 blues
  • 155 greens

Could our regression coefficients be biased when comparing, say reds to blues? Or would it be better to discard all records for red (or when applicable, re-code the variable)?

Best Answer

While not ideal (you would prefer indeed a more balanced color variable) it is not catastrophic. It would be more of a problem if your response variable was quite unbalanced (say 5 positives vs. 295 negatives). You might want to consider Firth regression (essentially a penalized logistic regression variant - see the function logistf from the package with the same name), if you suspect you might have complete separation within your dataset and/or you get nonsensically large standard errors (Wald estimates of standard error are commonly the first thing that fails in such situations). Yes, if it makes sense to recode your variables so there are no red levels; that would be ideal but I would suggest that only if it is reasonable and not just a trick to sweep some cumbersome observations under the rug.

(You do not mention convergence issues, so I assume you model can be estimated successfully.)