Mixed-Effects Modeling – Linear Mixed-Effects Modeling with Twin Study Data

covariance-matrixlme4-nlmemixed modelnon-independent

Suppose I have some some response variable $y_{ij}$ that was measured from $j$th sibling in $i$th family. In addition, some behavioral data $x_{ij}$ were collected at the same time from each subject. I'm trying to analyze the situation with the following linear mixed-effects model:

$$y_{ij} = \alpha_0 + \alpha_1 x_{ij} + \delta_{1i} x_{ij} + \varepsilon_{ij}$$

where $\alpha_0$ and $\alpha_1$ are the fixed intercept and slope respectively, $\delta_{1i}$ is the random slope, and $\varepsilon_{ij}$ is the residual.

The assumptions for the random effects $\delta_{1i}$ and residual $\varepsilon_{ij}$ are (assuming there are only two siblings within each family)

\begin{align}
\delta_{1i} &\stackrel{d}{\sim} N(0, \tau^2) \\[5pt]
(\varepsilon_{i1}, \varepsilon_{i2})^T &\stackrel{d}{\sim} N((0, 0)^T, R)
\end{align}

where $\tau^2$ is an unknown variance parameter and the variance-covariance structure $R$ is a 2 x 2 symmetric matrix of form

$$\begin{pmatrix}
r_1^2&r_{12}^2\\
r_{12}^2&r_2^2
\end{pmatrix}$$

that models the correlation between the two siblings.

Is this an appropriate model for such a sibling study?
The data are a little bit complicated. Among the 50 families, close to 90% of them are dizygotic (DZ) twins. For the rest families,
1. two have only one sibling;
2. two have one DZ pair plus one sibling; and
3. two have one DZ pair plus two additional siblings.
I believe lme in the R package nlme can easily handle (1) with missing or unbalanced situation. My trouble is, how to deal with (2) and (3)? One possibility I can think of is to break each of those four families in (2) and (3) into two so that each subfamily would have one or two siblings so the above model could be still applied to. Is this fine? Another option would be to simply throw away the data from the extra one or two siblings in (2) and (3), which seems to be a waste. Any better approaches?
It seems that lme allows one to fix the $r$ values in the residual variance-covariance matrix $R$, for example $r_{12}^2$ = 0.5. Does it make sense to impose the correlation structure, or should I simply estimate it based on the data?

Best Answer

You can include twins and non-twins in a unified model by using a dummy variable and including random slopes in that dummy variable. Since all families have at most one set of twins, this will be relatively simple:

Let $A_{ij} = 1$ if sibling $j$ in family $i$ is a twin, and 0 otherwise. I'm assuming you also want the random slope to differ for twins vs. regular siblings - if not, do not include the $ \eta_{i3}$ term in the model below.

Then fit the model:

$$ y_{ij} = \alpha_{0} + \alpha_{1} x_{ij} + \eta_{i0} + \eta_{i1} A_{ij} + \eta_{i2} x_{ij} + \eta_{i3} x_{ij} A_{ij} + \varepsilon_{ij} $$

$\alpha_{0}, \alpha_{1}$ are fixed effect, as in your specifiation
$\eta_{i0}$ is the 'baseline' sibling random effect and $\eta_{i1}$ is the additional random effect that allows twins to be more similar than regular siblings. The sizes of the corresponding random effect variances quantify how similar siblings are and how much more similar twins are than regular siblings. Note that both twin and non-twin correlations are characterized by this model - twin correlations are calculated by summing random effects appropriately (plug in $A_{ij}=1$).
$\eta_{i2}$ and $\eta_{i3}$ have analogous roles, only they act as the random slopes of $x_{ij}$
$\varepsilon_{ij}$ are iid error terms - note that I have written your model slightly differently in terms of random intercepts rather than correlated residual errors.

You can fit the model using the R package lme4. In the code below the dependent variable is y, the dummy variable is A, the predictor is x, the product of the dummy variable and the predictor is Ax and famID is the identifier number for the family. Your data is assumed to be stored in a data frame D, with these variables as columns.

library(lme4) 
g <- lmer(y ~ x + (1+A+x+Ax|famID), data=D)

The random effect variables and the fixed effects estimates can be viewed by typing summary(g). Note that this model allows the random effects to be freely correlated with each other.

In many cases, it may make more sense (or be more easily interpretable) to assume independence between the random effects (e.g. this assumption is often made to decompose genetic vs. environmental familial correlation), in which case you'd instead type

g <- lmer(y ~ x + (1|famID) + (A-1|famID) + (x-1|famID) +(Ax-1|famID), data=D)

Fit the model

m<-MCMCglmm(cbind(x,y)~trait-1,
#trait-1 gives each variable a separate intercept
        random=~us(trait):group,
#the random effect has a separate intercept for each variable but allows and estiamtes the covariance between them.
        rcov=~us(trait):units,
#Allows separate residual variance for each trait and estimates the covariance between them
        family=c("gaussian","gaussian"),prior=p,data=df)

In the model summary summary(m) the G structure describes the variance and covariance of the random intercepts. The R structure describes the observation level variance and covariance of intercept, which function as residuals in MCMCglmm.

If you are of a Bayesian persuasion you can get the entire posterior distribution of the co/variance terms m$VCV. Note that these are variances after accounting for the fixed effects.

simulate data

library(MASS)
n<-3000

#draws from a bivariate distribution
df<-data.frame(mvrnorm(n,mu=c(10,20),#the intercepts of x and y
                   Sigma=matrix(c(10,-3,-3,2),ncol=2)))
#the residual variance covariance of x and y


#assign random effect value
number_of_groups<-100
df$group<-rep(1:number_of_groups,length.out=n)
group_var<-data.frame(mvrnorm(number_of_groups, mu=c(0,0),Sigma=matrix(c(3,2,2,5),ncol=2)))
#the variance covariance matrix of the random effects. c(variance of x,
#covariance of x and y,covariance of x and y, variance of y)

#the variables x and y are the sum of the draws from the bivariate distribution and the random effect
df$x<-df$X1+group_var[df$group,1]
df$y<-df$X2+group_var[df$group,2]

Estimating the original co/variance of the random effects requires a large number of levels to the random effect. Instead your model will likely estimate the observed co/variances which can be calculated by cov(group_var)

Paired t-test – Special Case of Linear Mixed-Effect Modeling

The equivalence of the models can be observed by calculating the correlation between two observations from the same individual, as follows:

As in your notation, let $Y_{ij} = \mu + \alpha_i + \beta_j + \epsilon_{ij}$, where $\beta_j \sim N(0, \sigma_p^2)$ and $\epsilon_{ij} \sim N(0, \sigma^2)$. Then $Cov(y_{ik}, y_{jk}) = Cov(\mu + \alpha_i + \beta_k + \epsilon_{ik}, \mu + \alpha_j + \beta_k + \epsilon_{jk}) = Cov(\beta_k, \beta_k) = \sigma_p^2$, because all other terms are independent or fixed, and $Var(y_{ik}) = Var(y_{jk}) = \sigma_p^2 + \sigma^2$, so the correlation is $\sigma_p^2/(\sigma_p^2 + \sigma^2)$.

Note that the models however are not quite equivalent as the random effect model forces the correlation to be positive. The CS model and the t-test/anova model do not.

EDIT: There are two other differences as well. First, the CS and random effect models assume normality for the random effect, but the t-test/anova model does not. Secondly, the CS and random effect models are fit using maximum likelihood, while the anova is fit using mean squares; when everything is balanced they will agree, but not necessarily in more complex situations. Finally, I'd be wary of using F/df/p values from the various fits as measures of how much the models agree; see Doug Bates's famous screed on df's for more details. (END EDIT)

The problem with your R code is that you're not specifying the correlation structure properly. You need to use gls with the corCompSymm correlation structure.

Generate data so that there is a subject effect:

set.seed(5)
x <- rnorm(10)
x1<-x+rnorm(10)
x2<-x+1 + rnorm(10)
myDat <- data.frame(c(x1,x2), c(rep("x1", 10), rep("x2", 10)), 
                    rep(paste("S", seq(1,10), sep=""), 2))
names(myDat) <- c("y", "x", "subj")

Then here's how you'd fit the random effects and the compound symmetry models.

library(nlme)
fm1 <- lme(y ~ x, random=~1|subj, data=myDat)
fm2 <- gls(y ~ x, correlation=corCompSymm(form=~1|subj), data=myDat)

The standard errors from the random effects model are:

m1.varp <- 0.5453527^2
m1.vare <- 1.084408^2

And the correlation and residual variance from the CS model is:

m2.rho <- 0.2018595
m2.var <- 1.213816^2

And they're equal to what is expected:

> m1.varp/(m1.varp+m1.vare)
[1] 0.2018594
> sqrt(m1.varp + m1.vare)
[1] 1.213816

Other correlation structures are usually not fit with random effects but simply by specifying the desired structure; one common exception is the AR(1) + random effect model, which has a random effect and AR(1) correlation between observations on the same random effect.

EDIT2: When I fit the three options, I get exactly the same results except that gls doesn't try to guess the df for the term of interest.

> summary(fm1)
...
Fixed effects: y ~ x 
                 Value Std.Error DF   t-value p-value
(Intercept) -0.5611156 0.3838423  9 -1.461839  0.1778
xx2          2.0772757 0.4849618  9  4.283380  0.0020

> summary(fm2)
...
                 Value Std.Error   t-value p-value
(Intercept) -0.5611156 0.3838423 -1.461839  0.1610
xx2          2.0772757 0.4849618  4.283380  0.0004

> m1 <- lm(y~ x + subj, data=myDat)
> summary(m1)
...
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.3154     0.8042  -0.392  0.70403   
xx2           2.0773     0.4850   4.283  0.00204 **

(The intercept is different here because with the default coding, it's not the mean of all subjects but instead the mean of the first subject.)

It's also of interest to note that the newer lme4 package gives the same results but doesn't even try to compute a p-value.

> mm1 <- lmer(y ~ x + (1|subj), data=myDat)
> summary(mm1)
...
            Estimate Std. Error t value
(Intercept)  -0.5611     0.3838  -1.462
xx2           2.0773     0.4850   4.283

Best Answer

Related Solutions

Solved – Intraclass Correlation Coefficients (ICC) with Multiple Variables

Fit the model

simulate data

Paired t-test – Special Case of Linear Mixed-Effect Modeling

Related Question