Mixed Model – Assigning ‘Site’ and ‘Time’ as Fixed or Random Factors in GLMM for Longitudinal Study

fixed-effects-modelgeneralized linear modelmixed modelpanel data

I want to explore the relationship between the abundance of an organism and several possible explanatory factors. I have doubts regarding what variables should be called as fixed or random in the GLMM.

I have a dataset with the number of snails in different sites within a national park (all sites are under the same climatic conditions). But there are local parameters whose effects over the snail abundance haven't been studied yet.

This is a longitudinal study, with repeated measures over time (every month, for almost two years). The number of snails were counted in the field at night, always in the same 21 sites (each site has a 6×6 square meters plot, delimitated with wooden stakes).

In case it could influence the analysis, note that some parameters may vary over time, such as the vegetation cover in each plot, or the presence of the snail natural predator (measured with yes/no values). Others, however, are always the same, because they are specific to each site, such as the distant to the nearest riverbed or the type of soil.

Here is a subset of my data:

> snail.data
   site time snails vegetation_cover predator type_soil distant_riverbed
1     1    1      9               NA        n         1               13
2     1    2      7              0.8        n         1               13
3     1    3     13              1.4        n         1               13
4     1    4     14              0.6        n         1               13
5     1    5     12              1.6        n         1               13
10    2    1      0               NA        n         1              136
11    2    2      0              0.0        n         1              136
12    2    3      0              0.0        n         1              136
13    2    4      0              0.0        n         1              136
14    2    5      0              0.0        n         1              136
19    3    1      1               NA        n         2              201
20    3    2      0              0.0        n         2              201
21    3    3      0              0.0        y         2              201
22    3    4      3              0.0        n         2              201
23    3    5      2              0.0        n         2              201
28    4    1      0               NA        n         2              104
29    4    2      0              0.0        n         2              104
30    4    3      0              0.0        y         2              104
31    4    4      0              0.0        n         2              104
32    4    5      0              0.0        n         2              104
37    5    1      1               NA        n         3               65
38    5    2      0              2.4        n         3               65
39    5    3      3              2.2        n         3               65
40    5    4      2              2.2        n         3               65
41    5    5      4              2.0        y         3               65
46    6    1      1               NA        n         3               78
47    6    2      2              3.0        n         3               78
48    6    3      7              2.8        n         3               78
49    6    4      3              1.8        n         3               78
50    6    5      6              1.2        y         3               78
55    7    1     14               NA        n         3               91
56    7    2     21              2.8        n         3               91
57    7    3     16              2.6        n         3               91
58    7    4     15              1.6        n         3               91
59    7    5      8              2.0        n         3               91

So I'm interested in investigating if the number of snails is significantly different in each site and if those differences are related to some specific parameter.

So far the best statistical approach I have found is a generalized linear mixed model. But I'm struggling in choosing the correct fixed and random variables. My reasoning is, although I'm checking for the differences among sites (by comparing the number of snails) the focus of the study is the other parameters commented above, thus the site would be a random factor.

Then, my question is: should 'site' and 'time' be considered random factors and the local parameters should be the fixed variables? And should I include interactions between time and other parameters? Since climatic conditions are equal for all sites at the same time, I ruled out the possibility of interactions between time and site, is this correct?

I have set up my command as follows in R:

library(lme4)
mixed_model <- glmer(snails ~ vegetation_cover + predator + type_soil + distant_riverbed + (1|site) + (1|time), data = snails.data, family = poisson)

Would it be the correct syntax for what I have described?

I have read a couple of basic tutorials and other related posts (Generalized Mixed Model with repeated measurements, Fitting a Poisson GLM mixed model with a random slope and intercept, Interactions between random effects) that were helpful to better undertand the mixed effect models, but the context and design of the field surveys are different.

Best Answer

These choices probably have more to do with your understanding of the subject matter than anything else. First, you should think in detail about what you ask:

I'm interested in investigating if the number of snails is significantly different in each site and if those differences are related to some specific parameter.

That's not quite the question that your mixed model addresses. The fixed effects certainly would help determine site characteristics associated with snail numbers. But the random intercept for site represents differences among sites beyond those from the fixed effects. Each random intercept represents a hypothetical site-associated difference from an overall baseline intercept, estimated at reference values of all your fixed-effect terms; the distribution of those random intercepts is forced to be Gaussian with mean 0.

Random-effect terms are a well established way to take into account site-specific associations with outcome that aren't "due to" the fixed effects or lead to a lack of independence in outcomes measured at the same site. You might interpret the variance in random effects for site in terms of what isn't associated with the fixed effects. But that's not how you phrased your question.

You almost certainly shouldn't incorporate site as a fixed effect in a model that includes the major site-related predictors as fixed effects, as that will again model the site effects beyond those due to the other fixed effects and use up a lot more degrees of freedom than a random-effect term in the process.

There's another general way to account for lack of independence arising from repeated measures: generalized estimating equations. In effect, that approach gives you the same point estimates for fixed effects as you get by ignoring the repeated measures but adjusts the covariances of those estimates to take the repeated measures into account. The R geepack package extends that approach to generalized linear models like your Poisson model.

The handling of time depends even more on your understanding of the subject matter. If there really is "a more or less stable number of detected snails over time" as you say in a comment, then a random effect for time could be a good way to go, allowing for correlations within each time period and assessing the magnitude of random deviations over time.

If there are trends in snail numbers over time, however, you would want to model them directly. There's also the possibility of sequential correlations of snail numbers over time. For example, if reproductive cycles and natural turnover rates are of the same scale as your time periods, larger numbers in one period might be associated with larger numbers one or more time periods later. Then you might need to use more formal time-series analysis to handle time properly and remove those correlations. With only 2 years of data it might be tricky to do detailed seasonal time-series modeling as described on this page for a Poisson model.

So your best bet for handling time might be to combine a look at your data with what you know about the natural history of snails, and proceed accordingly.

Related Solutions

Poisson GLM – Fitting Mixed Model with Random Slope and Intercept

In R:

library(lme4)
lmer(counts ~ dependent_variable + (1+time|ID), family="poisson")

In this case $Y_i \sim {\rm Poisson}(\lambda_i)$ and this code fits the model

$$ \log(\lambda_i) = \beta_0 + \beta_1 X_i + \eta_{i1} + \eta_{i2}t $$

where $X_i$ is dependent_variable, $t$ is time and $i$ is ID. $\beta_0, \beta_1$ are fixed effects and $\eta_{i1}, \eta_{i2}$ are random effects whose variances are estimated by the model.

Here's an example with some quickly simulated data where the random effect variances truly are 0, the covariate has no effect, every outcome is ${\rm Poisson}(1)$, and each individual is seen 10 times at times $t = 1,...,10$.

x = rnorm(100)
t = rep(1:10,each=10)
ID = rep(1:10,10)
y = rpois(100,1)
g <- lmer(y ~ x + (1+t|ID), family="poisson")
summary(g)
Generalized linear mixed model fit by the Laplace approximation 
Formula: y ~ x + (1 + t | ID) 
   AIC   BIC logLik deviance
 108.8 121.9 -49.42    98.85
Random effects:
 Groups Name        Variance  Std.Dev. Corr   
 ID     (Intercept) 0.0285038 0.168831        
        t           0.0027741 0.052669 -1.000 
Number of obs: 100, groups: ID, 10

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.09078    0.11808  -0.769    0.442
x            0.13670    0.08845   1.546    0.122

Correlation of Fixed Effects:
  (Intr)
x -0.127

One point of caution - the Std.Dev. column is just the square root of the Variance column, NOT the standard error of the variance estimate!

Solved – R: GLMM for unbalanced zero-inflated data (glmmTMB)

A1: "All in all, I have about 33% of the dates having counts of zero, which makes me think the data is zero inflated." -> this is a common misconception - zero-inflation != lots of zeros. Zero-inflation means you have more zeros than you would expect, given your fitted model. Without having fit a model, you can't know what to expect. The DHARMa R package (disclaimer: I develop this package) has a zero-inflation test for GLMMs, including glmmTMB, that you can use to test your model. However, see notes in the vignette about zero-inflation: when fitting GLMMs with variable dispersion, zero-inflation often shows up as underdispersion, so the most reliable test is usually to run a model selection with ZIP against standard model.

A2: When running a Poisson GLMM with count data, you absolutely have to check for overdispersion!!! Fitting a poisson without check is a big no no. It would be very uncommon that your data is not overdispersed, so your poisson is likely not appropriate and you should move to a neg binom or an poisson with OLRE. DHARMa has dispersion test that works with glmmTMB.

A3: at the end of the DHARMa vignette, there is an example for analysing and checking zero-inflated count data (Owl dataset)

B/C makes sense

D/E Not generally a problem, but especially in this case you should put an RE on locality as well (nested location/site)

Best Answer

Related Solutions

Poisson GLM – Fitting Mixed Model with Random Slope and Intercept

Solved – R: GLMM for unbalanced zero-inflated data (glmmTMB)

Related Question