Mixed Models – Nested Crossed Random Effects for Repeated Measures Data in R

lme4-nlmemixed modelmultilevel-analysisr

Problem

There are two excellent CV posts on specifying crossed effects models (post 1, post 2).

The issue I'm trying to wrestle with pertains to part of the answer to post 2, in particular how to nest crossed random effects.

In my study, I have:

About 20 individuals per site
About 10 sites
Within each site, there were about 20 samples

The outcome in the example is participant's "interest" (the study is about out-of-school programs).

Because there are dependencies by both participant and sample, I think there are two crossed random effects, one for observations associated with each individual, and one for observations associated with each sample. The hard part for me is that these random effects are nested in one of the 10 programs.

The samples were at the same time for all of the individuals within the site, but at different times at different sites, so that sample 1 in site A was not necessarily at the same time in any sense (not the same date / time nor at the same interval from the "start" of the site's activities). Therefore, to create the variable identifying the time of the sample, I combined the site variable, the date that the sample was collected, and another variable specifying whether the sample was the 1st, 2nd, 3rd, or 4th sample collected for that date. It's a factor.

The data (in R) are as follows:

# A tibble: 2,970 × 4
    interest participant_ID  site           sample
   <dbl+lbl>          <dbl> <chr>         <fctr>
1          2           1001     1 1-2015-07-14-1
2          2           1001     1 1-2015-07-14-2
3          4           1001     1 1-2015-07-15-1
4          3           1001     1 1-2015-07-15-2
5          3           1001     1 1-2015-07-21-1
6          1           1001     1 1-2015-07-21-2
7          3           1001     1 1-2015-07-21-4
8          3           1001     1 1-2015-07-22-1
9          4           1001     1 1-2015-07-22-4
10         3           1001     1 1-2015-07-28-1
# ... with 2,960 more rows

Possible Solution

In the answer to post 2, the author of the selected answer wrote:

Because you do not have unique values of the tow variable (i.e.
because as you say below tows are specified as 1, 2, 3 at every
station), you do need to specify the nesting, as (1|station:tow:day).
If you did have the tows specified uniquely, you could use either
(1|tow:day) or (1|station:tow:day) (they should give equivalent
answers).

In mapping this to my example, I do have unique values of the sample (tow variable), I do not need to specify the nesting. I'm having trouble specifying this model mathematically, and, thus, in terms of model syntax. (I am using lme4 in R).

But, here seem to be the options:

Not nesting the crossed random effects within the site because the sample variable includes a site identifier:

lmer(interest ~ 1 + (1|participant_ID) + (1|sample), data = df)
Creating the sample variable without a site identifier but in a way so that samples within each site were still identified uniquely and nesting the crossed random effects within the site:

lmer(interest ~ 1 + (1|site/participant_ID) + (1|site/sample), data = df)

Other examples interact the crossed random effects, via adding a term such as (1|participant_ID:sample).

Does either of these seem like they would account for dependencies by both participant and sample? Or, are there other options or better ways to model this?

Best Answer

Here's my read on what experiment was done and how I would model it.

Each observation has three identifiers and one value.

You stated that these identifiers are participant, site, and sample and the value is interest.

You explained to me that there are many levels of each of these identifiers and you expect measurements that have a common value of any one of them (same participant, same site or same sample) are likely to be more similar than observations that have none in common.

This sounds like a perfect situation for an LMM with random intercept for each of those factors. Thus, model I would fit would be:

lmer(formula = interest ~ (1|participant_ID) + (1|site) + (1|sample),
     data = df)

EDIT: deleted misunderstandings.

Tow nested within station when tow is random and station is fixed

station+(1|station:tow) is correct. As @John said in his answer, (1|station/tow) would expand to (1|station)+(1|station:tow) (main effect of station plus interaction between tow and station), which you don't want because you have already specified station as a fixed effect.

Interaction between station and day when station is fixed and day is random.

The interaction between a fixed and a random effect is always random. Again as @John said, station*day expands to station+day+station:day, which you (again) don't want because you've already specified day in your model. I don't think there is a way to do what you want and collapse the crossed effects of day (random) and station (fixed), but you could if you wanted write station+(1|day/station), which as specified in the previous answer would expand to station + (1|day) + (1|day:station).

interaction between tow and day when tow is nested in station

Because you do not have unique values of the tow variable (i.e. because as you say below tows are specified as 1, 2, 3 at every station, you do need to specify the nesting, as (1|station:tow:day). If you did have the tows specified uniquely, you could use either (1|tow:day) or (1|station:tow:day) (they should give equivalent answers). If you do not specify the nesting in this case, lme4 will try to estimate a random effect that is shared by tow #1 at all stations ...

One way to diagnose whether you've specified the random effects correctly is to look at the number of observations reported for each grouping variable and see whether it agrees with what you expect (for example, the station:tow:day group should have a number of observations corresponding to the total number of station $\times$ tow $\times$ day combinations: if you forgot the nesting with station, you should see that you get fewer observations than you ought.

Are http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-specification and http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#nested-or-crossed useful to you?

R – Using Nested Random Effects in LME4

Let me try to answer Q3 first:

Is it justifiable to break up the models into two? I don't think so. Instead of finding a way around the message fixed-effect model matrix is rank deficient so dropping 1 column / coefficient, we should try to figure out why that happened. Usually, what this means is that you have not enough information to estimate the specified model.

I created an example that matches your experimental design above:

set.seed(23)
## YOUR DESIGN AS DEPITCTED IN THE IMAGE
surv <- data.frame(
  Gradient = rep(c("In", "Out"), each = 30),
  Site = rep(paste("Site", c(1:4), sep = ""), each = 15),
  Transect = rep(paste("T", c(1:12), sep = ""), each = 5),
  Plate = factor(sample(c(1:60))),
  # GET SOME RANDOM NUMBERS FROM POISSON DISTRIBUTION
  Response = rpois(60, 6)
)

Now using your model a for example, we can reproduce the message fixed-effect model matrix is rank deficient so dropping 1 column / coefficient (actually that's also the case for your models b, c, d, and e):

library(lme4)
model_a <- glmer(Response ~ Gradient + Site + (1|Site),
                 data = surv, family = "poisson", nAGQ = 7)

When you look at the fixed effects output (summary(model_a), also check: model.matrix(model_a)), you can see that Site4 was dropped. This has most likely to do with the fact that your fixed effects (factors and levels) cannot be represented as unique combinations of each other. Or in other words, you have levels of Site which are not represented in levels of Gradient. This is nothing new since you already specified that Site is nested within Gradient. Now, this also means that you cannot test whether levels of Site are significantly different. Instead, you would want to quantify and account for the (random-)effect of Site (and Transect) on the fixed-effect Gradient.

Here's how I would specify the model:

model_b <- glmer(Response ~ Gradient + (1|Site) + (1|Transect), 
                 data=surv, family = "poisson")
summary(model_b)

[showing only relevant output]

Random effects:
 Groups   Name        Variance Std.Dev.
 Transect (Intercept) 0        0       
 Site     (Intercept) 0        0       
Number of obs: 60, groups:  Transect, 12; Site, 4

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.81916    0.07352  24.743   <2e-16 ***
GradientOut -0.04987    0.10530  -0.474    0.636    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

When you look at the part of the output where is says Number of obs: 60, groups: Transect, 12; Site, 4, you can see that the grouping matches your design. Also since your nesting is explicit, you don't need to specify nesting in your random-effects statement (using the \ sign) since it's implied by your data. The variance estimates of the random effects will probably change when you enter your real data.

As for Q1, I think that Plate itself isn't a level since it has no replicates and is also not repeatedly measured over time. The interaction of Transect:Plate would be the lowest level in your case, which is represented by (1|Transect) due to the explicit nesting structure in your data.

As for Q2, since Site seems to be of interest to you (included as fixed effect in your models), it could be treated as such, if you reduce its levels from four to two and replicate it in the Gradient levels In and Out. However, I don't know whether this is possible because I don't know enough about your Site factor. If that's reasonable, you could do this:

surv2 <- data.frame(
  Gradient = rep(c("In","Out"), each = 30),
  Site = rep(paste("Site", c(1:2), 2, sep = ""), each = 15),
  Transect = rep(paste("T", c(1:12), sep = ""), each = 5),
  Plate = factor(sample(c(1:60))),
  # GET SOME RANDOM NUMBERS FROM POISSON DISTRIBUTION
  Response = rpois(60, 6)
)

model_c <- glmer(Response ~ Gradient + Site  + (1|Transect), 
                 data = surv2, family = "poisson")
summary(model_c)

and potentially also allowing for random slopes between levels of Site:

model_d <- glmer(Response ~ Gradient + Site  + (Site|Transect), 
                 data = surv2, family = "poisson")
summary(model_d)

Note: This won't give you the fixed-effect model matrix is rank deficient so dropping 1 column / coefficient message any more.

I don't think that you need to choose between models by comparing the AICs in this case here. I think the most useful model given the data is model_b and perhaps model_c or model_d if this is feasible. But maybe someone else can comment if that's incorrect.