I will try to answer your questions one-by-one:
Does such a difference come from the sequential (instead of
marginal) approach in lme4 in accounting for data variability?
Correct. As you can see, only for the interaction are the results the same. The interaction is entered last into the model in both cases, so the results for that term are the same. However, if you enter "level" first and then "RT", the results for "RT" tell you whether "RT" is significant after "level" is already in the model (and vice-versa). These results are order-dependent.
What does such a big difference mean?
Suppose both variables by themselves are strongly related to the response variable, but they are also strongly correlated. In that case, there may not be a whole lot of variability in the response variable left to account for by the variable that is entered second into the model. Therefore, you will tend to see more dramatic differences when the explanatory variables are correlated.
Does it mean that the model needs more tuning until big difference disappears?
I am not sure what you mean by "tuning". The phenomenon you are observing is not a problem per se, although it does complicate the interpretation of the results (see below).
Maybe one way of "tuning" is this. If the explanatory variables are highly correlated, then they may essentially be measuring the same thing. In that case, one can "tune" the model by either removing one of the variables or combining them into a single variable.
My second question is that, if I want to know which variable among the
two (RT and level) accounts for more data variability, what would be a
reasonable approach? Based on the relative magnitude of Sum Sq (or
Mean Sq) of the two variables? Any statistical testing method to
compare variability among explanatory variables?
When the explanatory variables are correlated, then it is rather difficult to determine their relative importance. This issue comes up quite frequently in the multiple regression context and dozens of articles have been written on this topic and lots of methods for accomplishing this goal have been suggested. There certainly is no consensus on the most appropriate way and some people may even suggest that there is no adequate way of doing that.
The sums of squares are not going to help you, because they are not based on the same number of degrees of freedom. The mean squares essentially correct for that, but if you use the mean squares, then this is nothing else than using the corresponding F-values (or p-values) to determine the relative importance. I think most people would not consider that an appropriate way of determining the relative importance.
Unfortunately, I do not have an easy solution. Instead, I can suggest a website to you, from the author of the relaimpo
package. I don't think the package will help you when fitting mixed-effects models, but there are lots of references to papers on the issue you are dealing with.
http://prof.beuth-hochschule.de/groemping/relaimpo/
You may also want to look into the AICcmodavg
package:
http://cran.r-project.org/web/packages/AICcmodavg/index.html
You can include twins and non-twins in a unified model by using a dummy variable and including random slopes in that dummy variable. Since all families have at most one set of twins, this will be relatively simple:
Let $A_{ij} = 1$ if sibling $j$ in family $i$ is a twin, and 0 otherwise. I'm assuming you also want the random slope to differ for twins vs. regular siblings - if not, do not include the $ \eta_{i3}$ term in the model below.
Then fit the model:
$$ y_{ij} = \alpha_{0} + \alpha_{1} x_{ij} + \eta_{i0} + \eta_{i1} A_{ij}
+ \eta_{i2} x_{ij} + \eta_{i3} x_{ij} A_{ij} + \varepsilon_{ij} $$
$\alpha_{0}, \alpha_{1}$ are fixed effect, as in your specifiation
$\eta_{i0}$ is the 'baseline' sibling random effect and $\eta_{i1}$ is the additional random effect that allows twins to be more similar than regular siblings. The sizes of the corresponding random effect variances quantify how similar siblings are and how much more similar twins are than regular siblings. Note that both twin and non-twin correlations are characterized by this model - twin correlations are calculated by summing random effects appropriately (plug in $A_{ij}=1$).
$\eta_{i2}$ and $\eta_{i3}$ have analogous roles, only they act as the random slopes of $x_{ij}$
$\varepsilon_{ij}$ are iid error terms - note that I have written your model slightly differently in terms of random intercepts rather than correlated residual errors.
You can fit the model using the R
package lme4
. In the code below the dependent variable is y
, the dummy variable is A
, the predictor is x
, the product of the dummy variable and the predictor is Ax
and famID
is the identifier number for the family. Your data is assumed to be stored in a data frame D
, with these variables as columns.
library(lme4)
g <- lmer(y ~ x + (1+A+x+Ax|famID), data=D)
The random effect variables and the fixed effects estimates can be viewed by typing summary(g)
. Note that this model allows the random effects to be freely correlated with each other.
In many cases, it may make more sense (or be more easily interpretable) to assume independence between the random effects (e.g. this assumption is often made to decompose genetic vs. environmental familial correlation), in which case you'd instead type
g <- lmer(y ~ x + (1|famID) + (A-1|famID) + (x-1|famID) +(Ax-1|famID), data=D)
Best Answer
Comments:
I was intrigued by the claim that a study of monozygotic/identical (MZ) and dizygotic/fraternal (DZ) twins assumes that twins are unrelated. So I took a look at the article, Collective effects of human genomic variation on microbiome function by New et al., which is fortunately open-access.
Here is the authors' description of their statistical analysis:
It sure doesn't sound like New et al. ignore that twins are related. Let's investigate further.
In a great example of reproducibility, the authors have made their code and scripts available on GitHub. So I took a look at their statistical analysis.
Here is the structure of their mixed-effects model:
Since this is a study of twins, the twinstatus variable is either "MZ" or "DZ". So for identical twins the model has a random component (1 | MZ_pair) and for fraternal twins the model has a random component (1 | DZ_pair) instead of a single random ctomponent (1 | pair). This allows the model to estimate variance (relatedness) between MZ twins separately from the variance (relatedness) between DZ twins.
The data is (understandably) not freely available, so let's simulate a dataset. The goal is to show that the model doesn't ignore relatedness between twins. I don't bother with SampleShipmentNum which is irrelevant to how relatedness between twins is modeled.
Here dummy(twinstatus, "MZ") is an indicator variable equal to 1 if the twins are monozygotic and 0 otherwise. Similary, dummy(twinstatus, "DZ") is an indicator variable equal to 1 if the twins are dizygotic.
So the model specifies that each pair of twins share a random intercept (ie, one intercept per family). This random intercept captures relatedness between twins.
Random intercepts for 6 families with identical twins. For example, the twins in MZ family #1 both share the family-specific component 0.812.
Random intercepts for 6 families with fraternal twins. For example, the twins in DZ family #120 both share the family-specific component 1.081.
Clearly New et al. do not assume that twins are unrelated. That would be strange.
[1] New, F.N., Baer, B.R., Clark, A.G. et al. Collective effects of human genomic variation on microbiome function. Sci Rep 12, 3839 (2022). https://doi.org/10.1038/s41598-022-07632-3