Solved – Intraclass Correlation Coefficients (ICC) with Multiple Variables

intraclass-correlationmixed model

Suppose I have measured some variable in siblings, which are nested within families. The data structure looks like this:

family sibling value
------ ------- -----
1      1       y_11
1      2       y_12
2      1       y_21
2      2       y_22
2      3       y_23
...    ...     ...

I want to know the correlation between measurements taken on siblings within the same family. The usual way of doing that is to calculate the ICC based on a random-intercept model:

res <- lme(yij ~ 1, random = ~ 1 | family, data=dat)
getVarCov(res)[[1]] / (getVarCov(res)[[1]] + res$s^2)

This would be equivalent to:

res <- gls(yij ~ 1, correlation = corCompSymm(form = ~ 1 | family), data=dat)

except that the latter approach also allows for a negative ICC.

Now suppose I have measured three items in siblings nested within families. So, the data structure looks like this:

family sibling item value
------ ------- ---- -----
1      1       1    y_111
1      1       2    y_112
1      1       3    y_113
1      2       1    y_121
1      2       2    y_122
1      2       3    y_123
2      1       1    y_211
2      1       2    y_212
2      1       3    y_213
2      2       1    y_221
2      2       2    y_222
2      2       3    y_223
2      3       1    y_231
2      3       2    y_232
2      3       3    y_233
...    ...     ...  ...

Now, I want to find out about:

the correlation between measurements taken on siblings within the same family for the same item
the correlation between measurements taken on siblings within the same family for different items

If I only had pairs of siblings within families, I would just do:

res <- gls(yijk ~ item, correlation = corSymm(form = ~ 1 | family), 
           weights = varIdent(form = ~ 1 | item), data=dat)

which gives me a $6 \times 6$ var-cov matrix on the residuals of the form:

$\left[\begin{array}{ccc|ccc}
\sigma^2_1 & \rho_{12} \sigma_1 \sigma_2 & \rho_{13} \sigma_1 \sigma_3 & \phi_{11} \sigma^2_1 & \phi_{12} \sigma_1 \sigma_2 & \phi_{13} \sigma_1 \sigma_3 \\
& \sigma^2_2 & \rho_{23} \sigma_2 \sigma_3 & & \phi_{22} \sigma^2_2 & \phi_{23} \sigma_2 \sigma_3 \\
& & \sigma^2_3 & & & \phi_{33} \sigma^2_3 \\ \hline
& & & \sigma^2_1 & \rho_{12} \sigma_1 \sigma_2 & \rho_{13} \sigma_1 \sigma_3 \\
& & & & \sigma^2_2 & \rho_{23} \sigma_2 \sigma_3 \\
& & & & & \sigma^2_3 \\
\end{array}\right]$

based on which I could easily estimate those cross-sibling correlations (the $\phi_{jj}$ values are the ICCs for the same item; the $\phi_{jj'}$ values are the ICCs for different items). However, as shown above, for some families, I have only two siblings, but for other families more than two. So, that makes me think that I need to get back to a variance-components type of model. However, the correlation between items may be negative, so I do not want to use a model that constraints the correlations to be positive.

Any ideas/suggestions of how I could approach this? Thanks in advance for any help!

Best Answer

The package MCMCglmm can easily handle and estimate covariance structures and random effects. However it does use bayesian statistics which can be intimidating to new users. See the MCMCglmm Course Notes for a thorough guide to MCMCglmm, and chapter 5 in particular for this question. I absolutely recommend reading up on assessing model convergence and chain mixing before analysing data for real in MCMCglmm.

library(MCMCglmm)

MCMCglmm uses priors, this is an uninformative inverse wishart prior.

p<-list(G=list(
  G1=list(V=diag(2),nu=0.002)),
R=list(V=diag(2),nu=0.002))

Fit the model

m<-MCMCglmm(cbind(x,y)~trait-1,
#trait-1 gives each variable a separate intercept
        random=~us(trait):group,
#the random effect has a separate intercept for each variable but allows and estiamtes the covariance between them.
        rcov=~us(trait):units,
#Allows separate residual variance for each trait and estimates the covariance between them
        family=c("gaussian","gaussian"),prior=p,data=df)

In the model summary summary(m) the G structure describes the variance and covariance of the random intercepts. The R structure describes the observation level variance and covariance of intercept, which function as residuals in MCMCglmm.

If you are of a Bayesian persuasion you can get the entire posterior distribution of the co/variance terms m$VCV. Note that these are variances after accounting for the fixed effects.

simulate data

library(MASS)
n<-3000

#draws from a bivariate distribution
df<-data.frame(mvrnorm(n,mu=c(10,20),#the intercepts of x and y
                   Sigma=matrix(c(10,-3,-3,2),ncol=2)))
#the residual variance covariance of x and y


#assign random effect value
number_of_groups<-100
df$group<-rep(1:number_of_groups,length.out=n)
group_var<-data.frame(mvrnorm(number_of_groups, mu=c(0,0),Sigma=matrix(c(3,2,2,5),ncol=2)))
#the variance covariance matrix of the random effects. c(variance of x,
#covariance of x and y,covariance of x and y, variance of y)

#the variables x and y are the sum of the draws from the bivariate distribution and the random effect
df$x<-df$X1+group_var[df$group,1]
df$y<-df$X2+group_var[df$group,2]

Estimating the original co/variance of the random effects requires a large number of levels to the random effect. Instead your model will likely estimate the observed co/variances which can be calculated by cov(group_var)

Related Solutions

R – Intraclass Correlation (ICC) for Interaction

The R model formula

lmer(measurement ~ 1 + (1 | subject) + (1 | site), mydata)

fits the model

$$ Y_{ijk} = \beta_0 + \eta_{i} + \theta_{j} + \varepsilon_{ijk} $$

where $Y_{ijk}$ is the $k$'th measurement from subject $i$ at site $j$, $\eta_{i}$ is the subject $i$ random effect, $\theta_{j}$ is the site $j$ random effect and $\varepsilon_{ijk}$ is the leftover error. These random effects have variances $\sigma^{2}_{\eta}, \sigma^{2}_{\theta}, \sigma^{2}_{\varepsilon}$ that are estimated by the model. (Note that if subject is nested within site, you would traditionally write $\theta_{ij}$ here instead of $\theta_{j}$).

To answer your first question regarding how to calculate the ICCs: under this model, the ICCs are the proportion of the total variation explained by the respective blocking factor. In particular, the correlation between two randomly selected observations on the same subject is:

$$ {\rm ICC}({\rm Subject}) = \frac{\sigma^{2}_{\eta}}{\sigma^{2}_{\eta}+ \sigma^{2}_{\theta}+\sigma^{2}_{\varepsilon}}$$

The correlation between two randomly selected observations from the same site is:

$$ {\rm ICC}({\rm Site}) = \frac{\sigma^{2}_{\theta}}{\sigma^{2}_{\eta}+ \sigma^{2}_{\theta}+\sigma^{2}_{\varepsilon}}$$

The correlation between two randomly selected observations on the same individual, and at the same site (the so-called interaction ICC) is:

$$ {\rm ICC}({\rm Subject/Site \ Interaction}) = \frac{\sigma^{2}_{\eta}+\sigma^{2}_{\theta}}{\sigma^{2}_{\eta}+ \sigma^{2}_{\theta}+\sigma^{2}_{\varepsilon}}$$

It seems you were confused by this being referred to as an "interaction" since it's the sum of individual terms. It's an "interaction" in the sense that it estimates the ${\rm ICC}$ corresponding to the blocking factor composed on the combination of Subject and site - it's important to note that you do not have to include some kind of "interaction" term between the factors to estimate this quantity.

Each of these quantities can be estimated by plugging in the estimates of these variances that come out of the model fitting.

Regarding your second question - as you can see here, each ${\rm ICC}$ has a fairly clear interpretation. I would argue that the interaction ${\rm ICC}$ does tell us something interesting - how "similar" are measurements that share both subject and site?

One important point to note is that if subjects are nested within sites, then the Subject ${\rm ICC}$ is not meaningful in it's own right, since it's impossible to share Subject and not site. Then $\sigma^{2}_{\eta}$ becomes only a measure of how much more similar individuals are to themselves, compared to other individuals at their site.

Mixed-Effects Modeling – Linear Mixed-Effects Modeling with Twin Study Data

You can include twins and non-twins in a unified model by using a dummy variable and including random slopes in that dummy variable. Since all families have at most one set of twins, this will be relatively simple:

Let $A_{ij} = 1$ if sibling $j$ in family $i$ is a twin, and 0 otherwise. I'm assuming you also want the random slope to differ for twins vs. regular siblings - if not, do not include the $ \eta_{i3}$ term in the model below.

Then fit the model:

$$ y_{ij} = \alpha_{0} + \alpha_{1} x_{ij} + \eta_{i0} + \eta_{i1} A_{ij} + \eta_{i2} x_{ij} + \eta_{i3} x_{ij} A_{ij} + \varepsilon_{ij} $$

$\alpha_{0}, \alpha_{1}$ are fixed effect, as in your specifiation
$\eta_{i0}$ is the 'baseline' sibling random effect and $\eta_{i1}$ is the additional random effect that allows twins to be more similar than regular siblings. The sizes of the corresponding random effect variances quantify how similar siblings are and how much more similar twins are than regular siblings. Note that both twin and non-twin correlations are characterized by this model - twin correlations are calculated by summing random effects appropriately (plug in $A_{ij}=1$).
$\eta_{i2}$ and $\eta_{i3}$ have analogous roles, only they act as the random slopes of $x_{ij}$
$\varepsilon_{ij}$ are iid error terms - note that I have written your model slightly differently in terms of random intercepts rather than correlated residual errors.

You can fit the model using the R package lme4. In the code below the dependent variable is y, the dummy variable is A, the predictor is x, the product of the dummy variable and the predictor is Ax and famID is the identifier number for the family. Your data is assumed to be stored in a data frame D, with these variables as columns.

library(lme4) 
g <- lmer(y ~ x + (1+A+x+Ax|famID), data=D)

The random effect variables and the fixed effects estimates can be viewed by typing summary(g). Note that this model allows the random effects to be freely correlated with each other.

In many cases, it may make more sense (or be more easily interpretable) to assume independence between the random effects (e.g. this assumption is often made to decompose genetic vs. environmental familial correlation), in which case you'd instead type

g <- lmer(y ~ x + (1|famID) + (A-1|famID) + (x-1|famID) +(Ax-1|famID), data=D)