I have 9 sites. Within each site, plant life was sampled to represent 70% of the basal area. Of the sampled plants, I know the corresponding family, genus, and species. For this project, I extracted leaf waxes and determined the total wax loading (TWL), average chain length (ACL), and the carbon preference index (CPI). My goal is to determine how much of the observed variation in TWL, ACL, and CPI, is explained by either the site, family, genus, or species. This would allow me to determine the major factors affecting the chemical composition of plant leaf waxes. Please note that most species, genera, and families occur several times within the data. The data look like this:
SITE FAM GEN SPEC TWL CPI ACL
1 TAM05 Fabaceae Tachigali Tachigali polyphylla 34.65 6.89 30.06
2 TAM05 Caryocaraceae Anthodiscus Anthodiscus peruanus 68.68 13.00 30.57
3 TAM05 Moraceae Clarisia Clarisia racemosa 46.85 12.22 29.68
4 TAM05 Lecythidaceae Bertholletia Bertholletia excelsa 69.64 6.27 30.52
5 TAM05 Moraceae Pseudolmedia Pseudolmedia laevigata 126.33 17.34 29.61
6 TAM05 Moraceae Pseudolmedia Pseudolmedia laevis 83.58 13.50 30.07
7 TAM05 Linaceae Roucheria Roucheria punctata 160.62 13.98 29.71
8 TAM05 Fabaceae Cedrelinga Cedrelinga cateniformis 151.12 10.82 30.17
9 TAM05 Urticaceae Pourouma Pourouma minor 61.47 1.41 29.64
10 TAM05 Fabaceae Sclerolobium Sclerolobium bracteosum 163.28 12.22 29.53
...
By reviewing similar studies, I eventually ended up with the use of a nested linear mixed-effects model in R. My analysis here would likely consist of running the same analysis three times, once for every variable (TWL, CPI, ACL).
I am, however, unsure of how to proceed. I am currently at the point where I believe the formulas to use would be:
TWL1 <- lmer(TWL ~ SITE + (1|FAM/GEN/SPEC))
CPI1 <- lmer(CPI ~ SITE + (1|FAM/GEN/SPEC))
ACL1 <- lmer(ACL ~ SITE + (1|FAM/GEN/SPEC))
Is this correct? And is it correct that I would have to run reduced models as well, and compare them with a chi-squared test?
Best Answer
I think the models you wrote are not incorrect, although I do wonder why you chose to treat Sites as fixed effects rather than random effects. There is nothing special about these sites, right? For example, you don't care about any differences among these particular 9 sites? If not, they are probably best considered random. 9 is not a lot of levels for a random factor, but hey, I'm sure it's expensive to get a lot of different sites.
Since FAM, GEN, and SPEC are explicitly nested in your dataset (e.g., each FAM has its own unique set of GEN labels associated with it), another way to write your models would be:
Although, as I hinted above, the models where all effects are random might make more sense:
Or even possibly the models where sites are random and the plant categories are fixed? Not sure about these but they seem at least not-obviously-crazy to me:
I'm not totally sure what "70% of the basal area" means, but if it implies that future replications of the study would most likely end up with the same set of plant categories (although obviously different individual plants), then maybe this last specification is defensible. But I leave that to your scientific judgment.
As for whether you want to compare models using likelihood ratio tests, it really just depends on what you are wanting to know. If your goal is to talk about proportions of variation due to each of the effects in your study, the models with all random effects would probably be easiest because you can compute those proportions simply by taking the ratio of each variance component over the sum of all the variance components.