Mixed Model – Linear Mixed Effects Modeling with MZ and DZ Twin Pairs

lme4-nlmemixed modelrandom-effects-model

I have a question on how to deal with following experimental setup. I have data on microbiome and metabolomics for twins that are either MZ or DZ, and thus we have only pairs in the dataset. I followed a similar setup from https://www.nature.com/articles/s41598-022-07632-3 , where they assume the twins are unrelated, and they state the following: "We include a random effect for the twin’s zygosity (monozygotic or dizygotic) to remove the effect of a person’s correlation with their own twin. This step allows the method to be generalizable to other cohorts since we are treating the individuals as if they were unrelated."

Suppose I want to model each of the microbe abundances (303 columns, "g") (that have been previously regressed against confounders) , in a linear mixed model to find differential microbes across the group of interest (here diagnosis "MD" (mental disorder), a categorical variable of 7 classes), I do the following:

lme <- lme(g ~ relevel(factor(MD), ref="HC"), random = list(ind= ~(0+dummy(Zygosity,"DZ")), ind=~ (0+dummy(Zygosity,"MZ"))) , data=microb.conf)

assuming a random slope for each individual; "ind" is the individual within the twin pair (being I1 or I2), but I wondered why? In the article, they also include pcs from genome data, but I do not have these data.
The problem is that for some of the genera I get this error message that the system is "computationally singular". I read a lot on this, to specify the covariance structure, but I am a bit lost on how to deal with this in this specific setup, because I'm not sure if I understand this model.

Any help would be greatly appreciated, thanks a lot!

Best Answer

Comments:

  • To understand the statement "This step allows the method to be generalizable to other cohorts since we are treating the individuals as if they were unrelated.", consider Figure 1 in the paper. New et al. describe a three-step procedure called sCCA. There is a preprocessing step that applies a GLMM to the twin data. The GLMM properly models the relatedness, so that the residuals from the GLMM are adjusted for the correlation between twins. Below I show how this preprocessing step is implemented.
  • You asked a very similar question about the linear mixed model(s) for microbiome gene/species abundances by New et al. here.

I was intrigued by the claim that a study of monozygotic/identical (MZ) and dizygotic/fraternal (DZ) twins assumes that twins are unrelated. So I took a look at the article, Collective effects of human genomic variation on microbiome function by New et al., which is fortunately open-access.

Here is the authors' description of their statistical analysis:

Prior to association testing, a Tweedie generalized linear mixed model was performed in R using the lme4 (v1.1.21) and statmod (v1.4.35) packages to extract residuals from the gene, species, and SNP data. The models include fixed terms for the individual’s age at sampling, their BMI, the shipment number of their sample, and a random effect for their twin status (either monozygotic or dizygotic) to remove the effect of a person’s correlation with their own twin otherwise known as relatedness.

It sure doesn't sound like New et al. ignore that twins are related. Let's investigate further.

In a great example of reproducibility, the authors have made their code and scripts available on GitHub. So I took a look at their statistical analysis.

Here is the structure of their mixed-effects model:

mod <- lmer(
  y ~ 1
    + Age.at.metagenomics.sample
    + IndividualBMI
    + (1 | SampleShipmentNum)
    + (0 + dummy(twinstatus, "MZ") | IndividualFamilyID)
    + (0 + dummy(twinstatus, "DZ") | IndividualFamilyID),
  data = meta.s
)

Since this is a study of twins, the twinstatus variable is either "MZ" or "DZ". So for identical twins the model has a random component (1 | MZ_pair) and for fraternal twins the model has a random component (1 | DZ_pair) instead of a single random ctomponent (1 | pair). This allows the model to estimate variance (relatedness) between MZ twins separately from the variance (relatedness) between DZ twins.

The data is (understandably) not freely available, so let's simulate a dataset. The goal is to show that the model doesn't ignore relatedness between twins. I don't bother with SampleShipmentNum which is irrelevant to how relatedness between twins is modeled.

library("lme4")
library("tidyverse")

set.seed(1234)

nMZ <- 35
nDZ <- 90

# Generate fake dataset
meta.s <-
  tibble(
    twinstatus = rep(c("MZ", "DZ"), times = c(nMZ, nDZ)),
    IndividualFamilyID = seq(nMZ + nDZ),
    Age.at.metagenomics.sample = sample(20:40, size = nMZ + nDZ, replace = TRUE),
    twin1 = 1L,
    twin2 = 2L,
    relatedness = rnorm(nMZ + nDZ)
  ) %>%
  pivot_longer(
    c(twin1, twin2)
  ) %>%
  mutate(
    IndividualBMI = rnorm(n()),
    y =
      relatedness + Age.at.metagenomics.sample + IndividualBMI
        + rnorm(n())
  )
# Fit LMM
mod <- lmer(
  y ~ 1
    + Age.at.metagenomics.sample
    + IndividualBMI
    + (0 + dummy(twinstatus, "MZ") | IndividualFamilyID)
    + (0 + dummy(twinstatus, "DZ") | IndividualFamilyID),
  data = meta.s
)

Here dummy(twinstatus, "MZ") is an indicator variable equal to 1 if the twins are monozygotic and 0 otherwise. Similary, dummy(twinstatus, "DZ") is an indicator variable equal to 1 if the twins are dizygotic.

So the model specifies that each pair of twins share a random intercept (ie, one intercept per family). This random intercept captures relatedness between twins.

summary(mod)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: 
#> y ~ 1 + Age.at.metagenomics.sample + IndividualBMI + (0 + dummy(twinstatus,  
#>     "MZ") | IndividualFamilyID) + (0 + dummy(twinstatus, "DZ") |  
#>     IndividualFamilyID)
#>    Data: meta.s
#> 
#> REML criterion at convergence: 847.8
#> 
#> Scaled residuals: 
#>      Min       1Q   Median       3Q      Max 
#> -2.06827 -0.46016  0.03908  0.54353  2.06268 
#> 
#> Random effects:
#>  Groups               Name                    Variance Std.Dev.
#>  IndividualFamilyID   dummy(twinstatus, "MZ") 2.0493   1.4315  
#>  IndividualFamilyID.1 dummy(twinstatus, "DZ") 0.8777   0.9369  
#>  Residual                                     0.8889   0.9428  
#> Number of obs: 250, groups:  IndividualFamilyID, 125
#> 
#> Fixed effects:
#>                            Estimate Std. Error t value
#> (Intercept)                 0.11482    0.55299   0.208
#> Age.at.metagenomics.sample  0.99824    0.01789  55.793
#> IndividualBMI               1.04561    0.07044  14.845
#> 
#> Correlation of Fixed Effects:
#>             (Intr) Ag.t..
#> Ag.t.mtgnm. -0.980       
#> IndividlBMI -0.004 -0.001

ranef_by_FamilyID <- ranef(mod)$IndividualFamilyID

Random intercepts for 6 families with identical twins. For example, the twins in MZ family #1 both share the family-specific component 0.812.

head(ranef_by_FamilyID)
#>   dummy(twinstatus, "MZ") dummy(twinstatus, "DZ")
#> 1               0.8121543                       0
#> 2              -0.3581110                       0
#> 3               0.7180179                       0
#> 4               0.7349613                       0
#> 5              -1.1501409                       0
#> 6               1.7900718                       0

Random intercepts for 6 families with fraternal twins. For example, the twins in DZ family #120 both share the family-specific component 1.081.

tail(ranef_by_FamilyID)
#>     dummy(twinstatus, "MZ") dummy(twinstatus, "DZ")
#> 120                       0              1.08155334
#> 121                       0              0.41079703
#> 122                       0              0.14490491
#> 123                       0             -0.04945822
#> 124                       0              0.41990437
#> 125                       0             -0.11631196

Clearly New et al. do not assume that twins are unrelated. That would be strange.

[1] New, F.N., Baer, B.R., Clark, A.G. et al. Collective effects of human genomic variation on microbiome function. Sci Rep 12, 3839 (2022). https://doi.org/10.1038/s41598-022-07632-3

Related Question