Solved – When the dependent variable and random effects ‘overlap’ in mixed effects models

lme4-nlmemixed modelphylogenyr

I have added a new example here for clarity, see original question below

Eg. I have 10 schools in 5 countries, ten students from each school is sampled.

Prediction variables: student test marks for Language, Math and Science
Response variable: school fee

I want to know what subject (ie Math) correlates with the schools fees.

lmer(fees~math+language+science+(1|country/school)) *each row is a student

But now I have the same fees for students within the same school, and school is added as a random effect. Is this allowed? Should I just take the average subject marks per school and drop the school random effect? See original question below

I have a dependent variable that depends on one of my random effects, as such:

Dep   R1   R2   X1   X2   X3
30    a    g    4    43   21
30    a    g    7    46   18
20    b    g    5    31   22
20    b    g    4    37   17
60    c    h    9    50   26
60    c    h    7    34   21

lmer(Dep~X1+X2+X3+(1|R2/R1))   (R2=Genus, R1=Species)

I need the random effect, as I have independent data for each specimen, but I know this setup cannot be correct. Plus some of my models fail to converge. I can use the average values of traits for each R1 and then drop the R1 random effect, but then I lose lots of data.

Can I use a linear mixed effects model for this? or should I be using another technique?

I have since decided to use a phylogeny with a PGLS, because taxonomic level random effects are too rough.

At the moment I am looking into pgls.Ives in phytools to account for within species level variation (see Helmus, M. R., Bland, T. J., Williams, C. K., & Ives, A. R. (2007). Phylogenetic measures of biodiversity. The American Naturalist, 169(3)).

Best Answer

I appreciate the school example, but for simplicity I stay with the original example, which was:

lmer(Dep~X1+X2+X3+(1|R2/R1)) (R2=Genus, R1=Species)

You make two comments

I can use the average values of traits for each R1 and then drop the R1 random effect, but then I lose lots of data
Response variable has no variation within species

So, within each group of R1, despite variation in the fixed effects, there is no difference in the response. This may or may not be the reason why you get identifiability problems, in any case you have a very high chance to wrongly attribute variation in the response to either fixed / random effects.

To solve this issue, I would probably go with your comment 1 after all, i.e. averaging trait values. If the response doesn't change there is nothing to be learned from the within-species variability, so you are not loosing information.

However, note that then the averaged X1,X2,X3 are estimates from a distribution, and thus have an error. Error on the predictors can bias regression slopes. You should consider using a method that accounts for error-in-variable, such as a model II regression. I would think the most convenient way to do this is a Bayesian solution, see, e.g. http://mbjoseph.github.io/blog/2013/05/27/typeII/

Addition: if you desperately want to include phylogenetic information on the species-level, you could use a) PGLS (e.g. http://link.springer.com/chapter/10.1007%2F978-3-662-43550-2_5), which accounts for phylogenetic signal in the residuals, or b) some mixed model where phylogenetic distance informs the covariance structure of the random effects. An example of the latter (admittedly not exactly what you want) is Ives, A. R. & Helmus, M. R. (2011) Generalized linear mixed models for phylogenetic analyses of community structure. Ecological Monographs, Ecological Monographs, 81, 511-525.

Related Solutions

Solved – Level-2 predictions with lme4/glmer model

I'm not 100% sure I know what you mean by the levels: according to the usual way I've seen this terminology used, level 1 would be "above" level 2, meaning the level of the whole population, so I'm not sure how we can have a "level-1 predictor". Anyway, I'm not sure I need to know, since you can set the fixed effects however you like within newdata. I think the answer to your question is in the help:

ReForm: formula for random effects to condition on. If ‘NULL’, include all random effects; if ‘NA’ or ‘~0’, include no random effects

so ReForm=NA gives population-level predictions (i.e. predictions based on not knowing what ID is being predicted); since you have only a single random effect, using either ReForm=~ID or ReForm=NULL will give predictions conditional on specified IDs. (I see you have set allow.new.levels=TRUE; I'm not sure how that will work with predicting at the ID level ...)

With the development version of lme4:

d <- data.frame(f=factor(rep(LETTERS[1:20],each=30)))
library(lme4)
d$y <- simulate(~1+(1|f),family="gaussian",newdata=d,
    newparams=list(beta=0,theta=1,sigma=0.1),seed=102)[[1]]
m <- lmer(y~1+(1|f),data=d)
newdata <- data.frame(f=factor(LETTERS[1:20]))
predict(m,newdata=newdata,ReForm=NA)  ## all identical
predict(m,newdata=newdata,ReForm=NULL)  ## different by f

(I'm not sure, but the capitalization of ReForm may have changed in the development version -- be careful.)

update: OK, you want to know the average probability of a student at school $j$ repeating a class. I think your approach is reasonable (the answer should be similar to the observed value, although in general it should be a shrinkage estimator [i.e. closer to the overall average). You might also want to consider calculating the probability that an average student at school $j$ would repeat, in which case you would first average the predictors ...

Best Answer

Related Solutions

Solved – Level-2 predictions with lme4/glmer model

Related Question