Solved – Rule of thumb for number of fixed effects in a mixed logistic regression to avoid over fitting

mixed modeloverfittingsample-size

I'm looking for clarification on how many fixed effects parameters can be reasonably included in a mixed logistic regression without saturating or over-fitting the model. As a background, my dataset consists of roughly 40,000 telemetry locations from 20 individual caribou (~2,000 locations per individual), and I'm using a mixed logistic regression framework to compare the habitat conditions at used telemetry locations (1's) to randomly sampled 'available' locations (0's). I specify the individual as a random effect to account for the nested data structure, and my predictor variables are things like elevation, land cover, terrain ruggedness, distance to roads and water, etc.

I'm familiar with the convention suggesting a minimum of 10 observations per parameter for linear and logistic regressions WITHOUT random effects (referenced in http://www.sciencedirect.com/science/article/pii/S0895435615000141 and https://academic.oup.com/aje/article/165/6/710/63906/Relaxing-the-Rule-of-Ten-Events-per-Variable-in), but in my situation I'm confused about whether to treat individuals or telemetry locations as my 'observational unit'. There are some useful links in this similar question: Is there a general rule about max nr of variables to use in (generalized) linear model?, and in several other questions pertaining to sample size for linear and logistic models, but it is not clear whether or how the rules of thumb discussed apply to a nested situation.

This is an important consideration because depending on how I calculate my sample size (i.e. based on total telemetry locations, telemetry locations per individual, or number of individuals), the maximum number of parameters advisable to include in a mixed model would vary drastically according to the 10:1 rule of thumb, from 4,000 (40,000 locations/10) to 20 (2,000 locations per individual/10) to 2 (20 individuals/10). For reference, I would ideally include ~15 candidate variables in a full model. If I could only include 2 variables without overfitting the model, I fear it would be a very poor model indeed. In the literature in my field, similar situations reference the individual animal as the experimental unit, however overfitting is rarely mentioned and the number of parameters included in models is highly variable.

My questions boil down to these:

1. What is the most appropriate way to calculate sample size in a nested study design such as mine? The total number of locations, the number of locations per individual, or the number of individuals?

2. Which measure of sample size should be used in identifying the maximum number of parameters to include in a mixed model to avoid overfitting the model in a nested study design such as mine? Does the 10:1 rule of thumb still work in a mixed modeling framework?

Best Answer

I wouldn't worry so much about rules of thumb as the number of data points that will be brought to bare for the parameters you are keen to have estimates for, as well as the variances of those estimates. For example, it sounds like you will have ample N to estimate the slopes of elevation, land cover, etc... (2000 per caribou, and 40k overall with slightly adjusted intercepts per caribou). But your estimate of the global intercept will likely be pretty noisy, since you only have 20 caribou. Although this is ultimately an empirical question—–it's possible that you have little variance across caribou. Have you measured the intraclass correlation coefficient to determine the necessity of a mixed effects model?

In general, it's hard to answer your question without knowing the model structure. Will you have random slopes for some predictors? What is the goal of the model? If it is merely a predictive model, than cross-validation or stepwise model selection techniques can give you an empirical answer to your question regardless of any rules of thumb. If some models fail to converge then there's your answer.

Related Question