Solved – Nested linear mixed-effects model

mixed modelnested datar

I have 9 sites. Within each site, plant life was sampled to represent 70% of the basal area. Of the sampled plants, I know the corresponding family, genus, and species. For this project, I extracted leaf waxes and determined the total wax loading (TWL), average chain length (ACL), and the carbon preference index (CPI). My goal is to determine how much of the observed variation in TWL, ACL, and CPI, is explained by either the site, family, genus, or species. This would allow me to determine the major factors affecting the chemical composition of plant leaf waxes. Please note that most species, genera, and families occur several times within the data. The data look like this:

     SITE              FAM           GEN                        SPEC     TWL   CPI   ACL
1   TAM05         Fabaceae     Tachigali        Tachigali polyphylla   34.65  6.89 30.06
2   TAM05    Caryocaraceae   Anthodiscus        Anthodiscus peruanus   68.68 13.00 30.57
3   TAM05         Moraceae      Clarisia           Clarisia racemosa   46.85 12.22 29.68
4   TAM05    Lecythidaceae  Bertholletia        Bertholletia excelsa   69.64  6.27 30.52
5   TAM05         Moraceae  Pseudolmedia      Pseudolmedia laevigata  126.33 17.34 29.61
6   TAM05         Moraceae  Pseudolmedia         Pseudolmedia laevis   83.58 13.50 30.07
7   TAM05         Linaceae     Roucheria          Roucheria punctata  160.62 13.98 29.71
8   TAM05         Fabaceae    Cedrelinga     Cedrelinga cateniformis  151.12 10.82 30.17
9   TAM05       Urticaceae      Pourouma              Pourouma minor   61.47  1.41 29.64
10  TAM05         Fabaceae  Sclerolobium     Sclerolobium bracteosum  163.28 12.22 29.53
...

By reviewing similar studies, I eventually ended up with the use of a nested linear mixed-effects model in R. My analysis here would likely consist of running the same analysis three times, once for every variable (TWL, CPI, ACL).

I am, however, unsure of how to proceed. I am currently at the point where I believe the formulas to use would be:

TWL1 <- lmer(TWL ~ SITE + (1|FAM/GEN/SPEC))
CPI1 <- lmer(CPI ~ SITE + (1|FAM/GEN/SPEC))
ACL1 <- lmer(ACL ~ SITE + (1|FAM/GEN/SPEC))

Is this correct? And is it correct that I would have to run reduced models as well, and compare them with a chi-squared test?

Best Answer

I think the models you wrote are not incorrect, although I do wonder why you chose to treat Sites as fixed effects rather than random effects. There is nothing special about these sites, right? For example, you don't care about any differences among these particular 9 sites? If not, they are probably best considered random. 9 is not a lot of levels for a random factor, but hey, I'm sure it's expensive to get a lot of different sites.

Since FAM, GEN, and SPEC are explicitly nested in your dataset (e.g., each FAM has its own unique set of GEN labels associated with it), another way to write your models would be:

TWL1 <- lmer(TWL ~ SITE + (1|FAM)+(1|GEN)+(1|SPEC))
CPI1 <- lmer(CPI ~ SITE + (1|FAM)+(1|GEN)+(1|SPEC))
ACL1 <- lmer(ACL ~ SITE + (1|FAM)+(1|GEN)+(1|SPEC))

Although, as I hinted above, the models where all effects are random might make more sense:

TWL1 <- lmer(TWL ~ (1|SITE)+(1|FAM)+(1|GEN)+(1|SPEC))
CPI1 <- lmer(CPI ~ (1|SITE)+(1|FAM)+(1|GEN)+(1|SPEC))
ACL1 <- lmer(ACL ~ (1|SITE)+(1|FAM)+(1|GEN)+(1|SPEC))

Or even possibly the models where sites are random and the plant categories are fixed? Not sure about these but they seem at least not-obviously-crazy to me:

TWL1 <- lmer(TWL ~ FAM + GEN + SPEC + (1|SITE))
CPI1 <- lmer(CPI ~ FAM + GEN + SPEC + (1|SITE))
ACL1 <- lmer(ACL ~ FAM + GEN + SPEC + (1|SITE))

I'm not totally sure what "70% of the basal area" means, but if it implies that future replications of the study would most likely end up with the same set of plant categories (although obviously different individual plants), then maybe this last specification is defensible. But I leave that to your scientific judgment.

As for whether you want to compare models using likelihood ratio tests, it really just depends on what you are wanting to know. If your goal is to talk about proportions of variation due to each of the effects in your study, the models with all random effects would probably be easiest because you can compute those proportions simply by taking the ratio of each variance component over the sum of all the variance components.

Related Question