Mixed Model – Variable Order and Accounted Variability in Linear Mixed-Effects Modeling

lme4-nlmemixed model

Suppose that, in a study of 15 subjects, the response variable (res) is modeled with two explanatory variables, one (level) is categorical with 5 levels and the other (response time: RT) is continuous. With lmer in the lme4 package of R, I have:

fm1 <- lmer(res ~ level * RT + (level-1 | subject), data=mydata)
anova(fm1)

             Df  Sum Sq Mean Sq  F value
level        4  3974.9   993.7   9.2181
RT           1  1953.5  1953.5  18.1209
level:RT     4  5191.4  1297.9  12.0393

If I change the order of the two variables, I get slightly different results for the main effects:

fm2 <- lmer(res ~ RT * level + (level-1 | subject), data=mydata)
anova(fm2)

             Df  Sum Sq Mean Sq  F value
RT           1  1671.8  1671.8  15.5077
level        4  4256.7  1064.2   9.8715
RT:level     4  5191.4  1297.9  12.0393

Does such a difference come from the sequential (instead of marginal) approach in lme4 in accounting for data variability? In this case, the variable order change does not lead to a big difference, but previously I've seen dramatic differences. What does such a big difference mean? Does it mean that the model needs more tuning until big difference disappears?

My second question is that, if I want to know which variable among the two (RT and level) accounts for more data variability, what would be a reasonable approach? Based on the relative magnitude of Sum Sq (or Mean Sq) of the two variables? Any statistical testing method to compare variability among explanatory variables?

Best Answer

I will try to answer your questions one-by-one:

Does such a difference come from the sequential (instead of marginal) approach in lme4 in accounting for data variability?

Correct. As you can see, only for the interaction are the results the same. The interaction is entered last into the model in both cases, so the results for that term are the same. However, if you enter "level" first and then "RT", the results for "RT" tell you whether "RT" is significant after "level" is already in the model (and vice-versa). These results are order-dependent.

What does such a big difference mean?

Suppose both variables by themselves are strongly related to the response variable, but they are also strongly correlated. In that case, there may not be a whole lot of variability in the response variable left to account for by the variable that is entered second into the model. Therefore, you will tend to see more dramatic differences when the explanatory variables are correlated.

Does it mean that the model needs more tuning until big difference disappears?

I am not sure what you mean by "tuning". The phenomenon you are observing is not a problem per se, although it does complicate the interpretation of the results (see below).

Maybe one way of "tuning" is this. If the explanatory variables are highly correlated, then they may essentially be measuring the same thing. In that case, one can "tune" the model by either removing one of the variables or combining them into a single variable.

My second question is that, if I want to know which variable among the two (RT and level) accounts for more data variability, what would be a reasonable approach? Based on the relative magnitude of Sum Sq (or Mean Sq) of the two variables? Any statistical testing method to compare variability among explanatory variables?

When the explanatory variables are correlated, then it is rather difficult to determine their relative importance. This issue comes up quite frequently in the multiple regression context and dozens of articles have been written on this topic and lots of methods for accomplishing this goal have been suggested. There certainly is no consensus on the most appropriate way and some people may even suggest that there is no adequate way of doing that.

The sums of squares are not going to help you, because they are not based on the same number of degrees of freedom. The mean squares essentially correct for that, but if you use the mean squares, then this is nothing else than using the corresponding F-values (or p-values) to determine the relative importance. I think most people would not consider that an appropriate way of determining the relative importance.

Unfortunately, I do not have an easy solution. Instead, I can suggest a website to you, from the author of the relaimpo package. I don't think the package will help you when fitting mixed-effects models, but there are lots of references to papers on the issue you are dealing with.

http://prof.beuth-hochschule.de/groemping/relaimpo/

You may also want to look into the AICcmodavg package:

http://cran.r-project.org/web/packages/AICcmodavg/index.html

Related Solutions

Solved – Testing simultaneous and lagged effects in longitudinal mixed models with time-varying covariates

I know this is probably too late for your benefit, but perhaps for others I will provide an answer.

You can include time-varying covariates in a longitudinal random-effects models (see Applied Longitudinal Analysis by Fitzmaurice, Laird and Ware, 2011 and http://www.ats.ucla.edu/stat/r/examples/alda/ specifically for R – use lme). Interpretation of trends depends on if you code time as categorical or continuous and your interaction terms. So for instance, if time is continuous and your covariates x1 and x2 are binary (0 and 1) and time-dependent, the fixed model is:

$$yij = \beta_0 + \beta_1x_{1ij} + \beta_2x_{2ij} + \beta_3time_{ij} + \beta_4 \times (x_{1ij} * time_{ij}) + \beta_5 \times (x_{2ij} * time_{ij})$$

i is for ith person, j is for jth occasion

$\beta_4$ and $\beta_5$ capture the difference in trends between levels of $x_1$ and $x_2$ while accounting for change over time in $x_1$ and $x_2$. Unless you specify $x_1$ and $x_2$ as random effects, correlations between the repeated measures will not be taken into account (but this needs to be based on theory and can get messy if you have too many random effects - i.e., model won’t converge). There is also some discussion about centering time-dependent covariates to remove bias, although I have not done this (Raudenbush & Bryk, 2002). Interpretation, in general, is also more difficult if you have a continuous time-dependent covariate.

$\beta_1$ and $\beta_2$ capture the cross-sectional association between $x_1$ and $y$ and $x_2$ and $y$ at the intercept ($\beta_0$). The intercept is where time is zero (baseline or wherever you centered your time variable). This interpretation could also be changed if you have a higher order model (e.g., quadratic).

You would code this in R as something like:

model<- lme(y ~ time*x1 + time*x2, data, random= ~time|subject, method="")

Singer and Willet appear to use ML for “method” but I have always been taught to use REML in SAS for overall results but compare the fit of different models using ML. I would imagine you could use REML in R too.

You can also model the correlation structure for y by adding to the previous code:

correlation = [you’ll have to look up the options]

I am not sure I understand your reasoning for only being able to test lagged effects. I am not familiar with modeling lagged effects so I can’t really speak to that here. Perhaps I am wrong, but I would imagine that modeling lagged effects would undermine the usefulness of mixed models (e.g., being able to include subjects with missing time-dependent data)

Solved – When the dependent variable and random effects ‘overlap’ in mixed effects models

I appreciate the school example, but for simplicity I stay with the original example, which was:

lmer(Dep~X1+X2+X3+(1|R2/R1)) (R2=Genus, R1=Species)

You make two comments

I can use the average values of traits for each R1 and then drop the R1 random effect, but then I lose lots of data
Response variable has no variation within species

So, within each group of R1, despite variation in the fixed effects, there is no difference in the response. This may or may not be the reason why you get identifiability problems, in any case you have a very high chance to wrongly attribute variation in the response to either fixed / random effects.

To solve this issue, I would probably go with your comment 1 after all, i.e. averaging trait values. If the response doesn't change there is nothing to be learned from the within-species variability, so you are not loosing information.

However, note that then the averaged X1,X2,X3 are estimates from a distribution, and thus have an error. Error on the predictors can bias regression slopes. You should consider using a method that accounts for error-in-variable, such as a model II regression. I would think the most convenient way to do this is a Bayesian solution, see, e.g. http://mbjoseph.github.io/blog/2013/05/27/typeII/

Addition: if you desperately want to include phylogenetic information on the species-level, you could use a) PGLS (e.g. http://link.springer.com/chapter/10.1007%2F978-3-662-43550-2_5), which accounts for phylogenetic signal in the residuals, or b) some mixed model where phylogenetic distance informs the covariance structure of the random effects. An example of the latter (admittedly not exactly what you want) is Ives, A. R. & Helmus, M. R. (2011) Generalized linear mixed models for phylogenetic analyses of community structure. Ecological Monographs, Ecological Monographs, 81, 511-525.

Best Answer

Related Solutions

Solved – Testing simultaneous and lagged effects in longitudinal mixed models with time-varying covariates

Solved – When the dependent variable and random effects ‘overlap’ in mixed effects models

Related Question