Mixed Model – Linear Mixed Effects Modeling with MZ and DZ Twin Pairs

lme4-nlmemixed modelrandom-effects-model

I have a question on how to deal with following experimental setup. I have data on microbiome and metabolomics for twins that are either MZ or DZ, and thus we have only pairs in the dataset. I followed a similar setup from https://www.nature.com/articles/s41598-022-07632-3 , where they assume the twins are unrelated, and they state the following: "We include a random effect for the twin’s zygosity (monozygotic or dizygotic) to remove the effect of a person’s correlation with their own twin. This step allows the method to be generalizable to other cohorts since we are treating the individuals as if they were unrelated."

Suppose I want to model each of the microbe abundances (303 columns, "g") (that have been previously regressed against confounders) , in a linear mixed model to find differential microbes across the group of interest (here diagnosis "MD" (mental disorder), a categorical variable of 7 classes), I do the following:

lme <- lme(g ~ relevel(factor(MD), ref="HC"), random = list(ind= ~(0+dummy(Zygosity,"DZ")), ind=~ (0+dummy(Zygosity,"MZ"))) , data=microb.conf)

assuming a random slope for each individual; "ind" is the individual within the twin pair (being I1 or I2), but I wondered why? In the article, they also include pcs from genome data, but I do not have these data.
The problem is that for some of the genera I get this error message that the system is "computationally singular". I read a lot on this, to specify the covariance structure, but I am a bit lost on how to deal with this in this specific setup, because I'm not sure if I understand this model.

Any help would be greatly appreciated, thanks a lot!

Best Answer

Comments:

To understand the statement "This step allows the method to be generalizable to other cohorts since we are treating the individuals as if they were unrelated.", consider Figure 1 in the paper. New et al. describe a three-step procedure called sCCA. There is a preprocessing step that applies a GLMM to the twin data. The GLMM properly models the relatedness, so that the residuals from the GLMM are adjusted for the correlation between twins. Below I show how this preprocessing step is implemented.
You asked a very similar question about the linear mixed model(s) for microbiome gene/species abundances by New et al. here.

I was intrigued by the claim that a study of monozygotic/identical (MZ) and dizygotic/fraternal (DZ) twins assumes that twins are unrelated. So I took a look at the article, Collective effects of human genomic variation on microbiome function by New et al., which is fortunately open-access.

Here is the authors' description of their statistical analysis:

Prior to association testing, a Tweedie generalized linear mixed model was performed in R using the lme4 (v1.1.21) and statmod (v1.4.35) packages to extract residuals from the gene, species, and SNP data. The models include fixed terms for the individual’s age at sampling, their BMI, the shipment number of their sample, and a random effect for their twin status (either monozygotic or dizygotic) to remove the effect of a person’s correlation with their own twin otherwise known as relatedness.

It sure doesn't sound like New et al. ignore that twins are related. Let's investigate further.

In a great example of reproducibility, the authors have made their code and scripts available on GitHub. So I took a look at their statistical analysis.

Here is the structure of their mixed-effects model:

mod <- lmer(
  y ~ 1
    + Age.at.metagenomics.sample
    + IndividualBMI
    + (1 | SampleShipmentNum)
    + (0 + dummy(twinstatus, "MZ") | IndividualFamilyID)
    + (0 + dummy(twinstatus, "DZ") | IndividualFamilyID),
  data = meta.s
)

Since this is a study of twins, the twinstatus variable is either "MZ" or "DZ". So for identical twins the model has a random component (1 | MZ_pair) and for fraternal twins the model has a random component (1 | DZ_pair) instead of a single random ctomponent (1 | pair). This allows the model to estimate variance (relatedness) between MZ twins separately from the variance (relatedness) between DZ twins.

The data is (understandably) not freely available, so let's simulate a dataset. The goal is to show that the model doesn't ignore relatedness between twins. I don't bother with SampleShipmentNum which is irrelevant to how relatedness between twins is modeled.

library("lme4")
library("tidyverse")

set.seed(1234)

nMZ <- 35
nDZ <- 90

# Generate fake dataset
meta.s <-
  tibble(
    twinstatus = rep(c("MZ", "DZ"), times = c(nMZ, nDZ)),
    IndividualFamilyID = seq(nMZ + nDZ),
    Age.at.metagenomics.sample = sample(20:40, size = nMZ + nDZ, replace = TRUE),
    twin1 = 1L,
    twin2 = 2L,
    relatedness = rnorm(nMZ + nDZ)
  ) %>%
  pivot_longer(
    c(twin1, twin2)
  ) %>%
  mutate(
    IndividualBMI = rnorm(n()),
    y =
      relatedness + Age.at.metagenomics.sample + IndividualBMI
        + rnorm(n())
  )

# Fit LMM
mod <- lmer(
  y ~ 1
    + Age.at.metagenomics.sample
    + IndividualBMI
    + (0 + dummy(twinstatus, "MZ") | IndividualFamilyID)
    + (0 + dummy(twinstatus, "DZ") | IndividualFamilyID),
  data = meta.s
)

Here dummy(twinstatus, "MZ") is an indicator variable equal to 1 if the twins are monozygotic and 0 otherwise. Similary, dummy(twinstatus, "DZ") is an indicator variable equal to 1 if the twins are dizygotic.

So the model specifies that each pair of twins share a random intercept (ie, one intercept per family). This random intercept captures relatedness between twins.

summary(mod)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: 
#> y ~ 1 + Age.at.metagenomics.sample + IndividualBMI + (0 + dummy(twinstatus,  
#>     "MZ") | IndividualFamilyID) + (0 + dummy(twinstatus, "DZ") |  
#>     IndividualFamilyID)
#>    Data: meta.s
#> 
#> REML criterion at convergence: 847.8
#> 
#> Scaled residuals: 
#>      Min       1Q   Median       3Q      Max 
#> -2.06827 -0.46016  0.03908  0.54353  2.06268 
#> 
#> Random effects:
#>  Groups               Name                    Variance Std.Dev.
#>  IndividualFamilyID   dummy(twinstatus, "MZ") 2.0493   1.4315  
#>  IndividualFamilyID.1 dummy(twinstatus, "DZ") 0.8777   0.9369  
#>  Residual                                     0.8889   0.9428  
#> Number of obs: 250, groups:  IndividualFamilyID, 125
#> 
#> Fixed effects:
#>                            Estimate Std. Error t value
#> (Intercept)                 0.11482    0.55299   0.208
#> Age.at.metagenomics.sample  0.99824    0.01789  55.793
#> IndividualBMI               1.04561    0.07044  14.845
#> 
#> Correlation of Fixed Effects:
#>             (Intr) Ag.t..
#> Ag.t.mtgnm. -0.980       
#> IndividlBMI -0.004 -0.001

ranef_by_FamilyID <- ranef(mod)$IndividualFamilyID

Random intercepts for 6 families with identical twins. For example, the twins in MZ family #1 both share the family-specific component 0.812.

head(ranef_by_FamilyID)
#>   dummy(twinstatus, "MZ") dummy(twinstatus, "DZ")
#> 1               0.8121543                       0
#> 2              -0.3581110                       0
#> 3               0.7180179                       0
#> 4               0.7349613                       0
#> 5              -1.1501409                       0
#> 6               1.7900718                       0

Random intercepts for 6 families with fraternal twins. For example, the twins in DZ family #120 both share the family-specific component 1.081.

tail(ranef_by_FamilyID)
#>     dummy(twinstatus, "MZ") dummy(twinstatus, "DZ")
#> 120                       0              1.08155334
#> 121                       0              0.41079703
#> 122                       0              0.14490491
#> 123                       0             -0.04945822
#> 124                       0              0.41990437
#> 125                       0             -0.11631196

Clearly New et al. do not assume that twins are unrelated. That would be strange.

[1] New, F.N., Baer, B.R., Clark, A.G. et al. Collective effects of human genomic variation on microbiome function. Sci Rep 12, 3839 (2022). https://doi.org/10.1038/s41598-022-07632-3

Related Solutions

Mixed Model – Variable Order and Accounted Variability in Linear Mixed-Effects Modeling

I will try to answer your questions one-by-one:

Does such a difference come from the sequential (instead of marginal) approach in lme4 in accounting for data variability?

Correct. As you can see, only for the interaction are the results the same. The interaction is entered last into the model in both cases, so the results for that term are the same. However, if you enter "level" first and then "RT", the results for "RT" tell you whether "RT" is significant after "level" is already in the model (and vice-versa). These results are order-dependent.

What does such a big difference mean?

Suppose both variables by themselves are strongly related to the response variable, but they are also strongly correlated. In that case, there may not be a whole lot of variability in the response variable left to account for by the variable that is entered second into the model. Therefore, you will tend to see more dramatic differences when the explanatory variables are correlated.

Does it mean that the model needs more tuning until big difference disappears?

I am not sure what you mean by "tuning". The phenomenon you are observing is not a problem per se, although it does complicate the interpretation of the results (see below).

Maybe one way of "tuning" is this. If the explanatory variables are highly correlated, then they may essentially be measuring the same thing. In that case, one can "tune" the model by either removing one of the variables or combining them into a single variable.

My second question is that, if I want to know which variable among the two (RT and level) accounts for more data variability, what would be a reasonable approach? Based on the relative magnitude of Sum Sq (or Mean Sq) of the two variables? Any statistical testing method to compare variability among explanatory variables?

When the explanatory variables are correlated, then it is rather difficult to determine their relative importance. This issue comes up quite frequently in the multiple regression context and dozens of articles have been written on this topic and lots of methods for accomplishing this goal have been suggested. There certainly is no consensus on the most appropriate way and some people may even suggest that there is no adequate way of doing that.

The sums of squares are not going to help you, because they are not based on the same number of degrees of freedom. The mean squares essentially correct for that, but if you use the mean squares, then this is nothing else than using the corresponding F-values (or p-values) to determine the relative importance. I think most people would not consider that an appropriate way of determining the relative importance.

Unfortunately, I do not have an easy solution. Instead, I can suggest a website to you, from the author of the relaimpo package. I don't think the package will help you when fitting mixed-effects models, but there are lots of references to papers on the issue you are dealing with.

http://prof.beuth-hochschule.de/groemping/relaimpo/

You may also want to look into the AICcmodavg package:

http://cran.r-project.org/web/packages/AICcmodavg/index.html

Mixed-Effects Modeling – Linear Mixed-Effects Modeling with Twin Study Data

You can include twins and non-twins in a unified model by using a dummy variable and including random slopes in that dummy variable. Since all families have at most one set of twins, this will be relatively simple:

Let $A_{ij} = 1$ if sibling $j$ in family $i$ is a twin, and 0 otherwise. I'm assuming you also want the random slope to differ for twins vs. regular siblings - if not, do not include the $ \eta_{i3}$ term in the model below.

Then fit the model:

$$ y_{ij} = \alpha_{0} + \alpha_{1} x_{ij} + \eta_{i0} + \eta_{i1} A_{ij} + \eta_{i2} x_{ij} + \eta_{i3} x_{ij} A_{ij} + \varepsilon_{ij} $$

$\alpha_{0}, \alpha_{1}$ are fixed effect, as in your specifiation
$\eta_{i0}$ is the 'baseline' sibling random effect and $\eta_{i1}$ is the additional random effect that allows twins to be more similar than regular siblings. The sizes of the corresponding random effect variances quantify how similar siblings are and how much more similar twins are than regular siblings. Note that both twin and non-twin correlations are characterized by this model - twin correlations are calculated by summing random effects appropriately (plug in $A_{ij}=1$).
$\eta_{i2}$ and $\eta_{i3}$ have analogous roles, only they act as the random slopes of $x_{ij}$
$\varepsilon_{ij}$ are iid error terms - note that I have written your model slightly differently in terms of random intercepts rather than correlated residual errors.

You can fit the model using the R package lme4. In the code below the dependent variable is y, the dummy variable is A, the predictor is x, the product of the dummy variable and the predictor is Ax and famID is the identifier number for the family. Your data is assumed to be stored in a data frame D, with these variables as columns.

library(lme4) 
g <- lmer(y ~ x + (1+A+x+Ax|famID), data=D)

The random effect variables and the fixed effects estimates can be viewed by typing summary(g). Note that this model allows the random effects to be freely correlated with each other.

In many cases, it may make more sense (or be more easily interpretable) to assume independence between the random effects (e.g. this assumption is often made to decompose genetic vs. environmental familial correlation), in which case you'd instead type

g <- lmer(y ~ x + (1|famID) + (A-1|famID) + (x-1|famID) +(Ax-1|famID), data=D)

Best Answer

Related Solutions

Mixed Model – Variable Order and Accounted Variability in Linear Mixed-Effects Modeling

Mixed-Effects Modeling – Linear Mixed-Effects Modeling with Twin Study Data

Related Question