Mixed-Effects Modeling – Linear Mixed-Effects Modeling with Twin Study Data

covariance-matrixlme4-nlmemixed modelnon-independent

Suppose I have some some response variable $y_{ij}$ that was measured from $j$th sibling in $i$th family. In addition, some behavioral data $x_{ij}$ were collected at the same time from each subject. I'm trying to analyze the situation with the following linear mixed-effects model:

$$y_{ij} = \alpha_0 + \alpha_1 x_{ij} + \delta_{1i} x_{ij} + \varepsilon_{ij}$$

where $\alpha_0$ and $\alpha_1$ are the fixed intercept and slope respectively, $\delta_{1i}$ is the random slope, and $\varepsilon_{ij}$ is the residual.

The assumptions for the random effects $\delta_{1i}$ and residual $\varepsilon_{ij}$ are (assuming there are only two siblings within each family)

\begin{align}
\delta_{1i} &\stackrel{d}{\sim} N(0, \tau^2) \\[5pt]
(\varepsilon_{i1}, \varepsilon_{i2})^T &\stackrel{d}{\sim} N((0, 0)^T, R)
\end{align}

where $\tau^2$ is an unknown variance parameter and the variance-covariance structure $R$ is a 2 x 2 symmetric matrix of form

$$\begin{pmatrix}
r_1^2&r_{12}^2\\
r_{12}^2&r_2^2
\end{pmatrix}$$

that models the correlation between the two siblings.

  1. Is this an appropriate model for such a sibling study?

  2. The data are a little bit complicated. Among the 50 families, close to 90% of them are dizygotic (DZ) twins. For the rest families,

    1. two have only one sibling;
    2. two have one DZ pair plus one sibling; and
    3. two have one DZ pair plus two additional siblings.

    I believe lme in the R package nlme can easily handle (1) with missing or unbalanced situation. My trouble is, how to deal with (2) and (3)? One possibility I can think of is to break each of those four families in (2) and (3) into two so that each subfamily would have one or two siblings so the above model could be still applied to. Is this fine? Another option would be to simply throw away the data from the extra one or two siblings in (2) and (3), which seems to be a waste. Any better approaches?

  3. It seems that lme allows one to fix the $r$ values in the residual variance-covariance matrix $R$, for example $r_{12}^2$ = 0.5. Does it make sense to impose the correlation structure, or should I simply estimate it based on the data?

Best Answer

You can include twins and non-twins in a unified model by using a dummy variable and including random slopes in that dummy variable. Since all families have at most one set of twins, this will be relatively simple:

Let $A_{ij} = 1$ if sibling $j$ in family $i$ is a twin, and 0 otherwise. I'm assuming you also want the random slope to differ for twins vs. regular siblings - if not, do not include the $ \eta_{i3}$ term in the model below.

Then fit the model:

$$ y_{ij} = \alpha_{0} + \alpha_{1} x_{ij} + \eta_{i0} + \eta_{i1} A_{ij} + \eta_{i2} x_{ij} + \eta_{i3} x_{ij} A_{ij} + \varepsilon_{ij} $$

  • $\alpha_{0}, \alpha_{1}$ are fixed effect, as in your specifiation

  • $\eta_{i0}$ is the 'baseline' sibling random effect and $\eta_{i1}$ is the additional random effect that allows twins to be more similar than regular siblings. The sizes of the corresponding random effect variances quantify how similar siblings are and how much more similar twins are than regular siblings. Note that both twin and non-twin correlations are characterized by this model - twin correlations are calculated by summing random effects appropriately (plug in $A_{ij}=1$).

  • $\eta_{i2}$ and $\eta_{i3}$ have analogous roles, only they act as the random slopes of $x_{ij}$

  • $\varepsilon_{ij}$ are iid error terms - note that I have written your model slightly differently in terms of random intercepts rather than correlated residual errors.

You can fit the model using the R package lme4. In the code below the dependent variable is y, the dummy variable is A, the predictor is x, the product of the dummy variable and the predictor is Ax and famID is the identifier number for the family. Your data is assumed to be stored in a data frame D, with these variables as columns.

library(lme4) 
g <- lmer(y ~ x + (1+A+x+Ax|famID), data=D) 

The random effect variables and the fixed effects estimates can be viewed by typing summary(g). Note that this model allows the random effects to be freely correlated with each other.

In many cases, it may make more sense (or be more easily interpretable) to assume independence between the random effects (e.g. this assumption is often made to decompose genetic vs. environmental familial correlation), in which case you'd instead type

g <- lmer(y ~ x + (1|famID) + (A-1|famID) + (x-1|famID) +(Ax-1|famID), data=D)