R – How to Handle Models with Unequal Sample Size for Accurate Analysis

lme4-nlmelsmeansrsample-size

I am currently running GLMM in R for comparing correct response and response time in two groups (patients and control), each composed of males and females.
I have 3 variables: Group, Sex. (inter subject) and Shift (intra-subject).
Here are my R models (the first one for correct response and the second one for response time):

glm2A <- glmer (CR2 ~ GroupC * ShiftC * SexC + (1 + ShiftC ||ID) + (1| stim),
               data = ESTdfWO1,
               family = binomial (link = "logit"),
               control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 100000)))
glmRT2 <- glmer (R2 ~ GroupC * ShiftC * SexC + (1 + ShiftC ||ID) + (1| stim),
               data = ESTdfWO2,
               family = inverse.gaussian (link = "identity"))

I have unequal sample size:

  • 53 males and 47 females in the patient's group
  • 60 males and 115 females in the control group.

I read that unequal sample size is not a problem with GLMM, but I don't find much on that (and one of my supervisor not agree with that, but I think she is most used to ANOVA and I know that it would be a problem with ANOVA).
Mainly, I would like to have advice (and reading recommandation) on:

  1. Are unequal sample size a problem for glmm? I would like to find paper that I can read and quote regarding this issue. Does anybody know some paper on it?
  2. If yes: What type of statistical analysis should I do instead? Should I drop women in my control group (the advice of one of my supervisor)
  3. If no: how GLMM take account for inequal sample size? Are there assumptions to check for GLMM (if yes, which ones?)
  4. Is it ok to use emmeans and pairs for post hoc test ( pair comparison) with unequal sample size? If not, what should I use (in R)?

Thank you for your help.

Best Answer

  1. No, unequal sample sizes are not a problem.

I don't know if there is a canonical citation for the unbalanced data claim, but this paper mentions it, and it comes from one of the authors of the lme4 package.

Pinheiro, J. C. (2014). Linear mixed effects models for longitudinal data. Wiley StatsRef: Statistics Reference Online.

  1. NA

  2. Just check the regular assumptions of the GLMMs that you are using.

  3. As far as being too unbalanced for using emmeans, I don't think that is a problem, and have frequently seen it used on unbalanced data in psychology/cognitive science papers. Many studies are run on undergraduate students and roughly 2/3 students at North American universities are female, so this is a common issue. Also, in the documentation for the emmeans package, one of the examples they give uses unbalanced data:

https://cran.r-project.org/web/packages/emmeans/vignettes/basics.html

Edit to include information about checking the assumptions of the GLMMs:

For logistic regression, there are not too many. The main one will be multicollinearity, which can be checked using the vif() function from the car package. This website has some more information, and there are plenty of others:

http://www.sthda.com/english/articles/36-classification-methods-essentials/148-logistic-regression-assumptions-and-diagnostics-in-r/

For the inverse gaussian model, I've never run one, but this post seems to have some useful sources linked in the answer.

https://stats.stackexchange.com/questions/422383/inverse-gaussian-glm-residual-deviance

Related Question