Solved – Create synthetic data with a given intraclass correlation coefficient (ICC)

intraclass-correlationrsynthetic data

I want to generate some synthetic data with $I$ observations across $J$ clusters. Additionally, I want the intraclass correlation coefficient ($ICC$) to be an input of my data generation process. So, at the end I want to end-up with a data frame that has 2 columns: 1. a cluster ID, 2. outcome

$Y_{ij}$ is the outcome for individual $i$ in cluster $j$

$$ Y_{ij} = \mu + \alpha_j + \epsilon_{ij} $$

The intraclass correlation is defined as

$$ ICC = \frac{\sigma_\alpha^2}{ \sigma_\alpha^2 + \sigma_\epsilon^2}$$

So, if i want $ICC = 0.2$ and $\sigma_\epsilon^2 = 1$, I can solve for $\sigma_\alpha^2$

f_var_alpha <- function(ICC, var_epsilon){
  var_alpha <- (ICC*var_epsilon)/(1-ICC)
  return(var_alpha)
}

var_alpha <-f_var_alpha(ICC = 0.2, var_epsilon = 1)

var_alpha

Now I could use $\sigma_\alpha^2 = 0.25$ to generate my data.

Suppose that I want to generate 1000 observations across 10 clusters, and that $\mu = 0.5$. This is what i did:

library(tidyverse)
set.seed(22217)
N <- 1000
J <- 10

n_per_j <- N/J

gen_data_j <- function(j, J, N){
  n_per_j <- N/J
  cluster_j <- data.frame(J=LETTERS[j], 
                          alpha_j = rnorm(n = n_per_j, mean = 0, sd = sqrt(var_alpha)),
                          epsilon_ij = rnorm(n = n_per_j, mean = 0, sd = 1)) %>% 
    mutate(y = 0.5 + alpha_j + epsilon_ij)
  return(cluster_j)
}

df <- lapply(X = 1:J, FUN = gen_data_j, N=N, J=J) %>% bind_rows() %>% 
  mutate(J = as.factor(J))

Alas, if I check the ICC i don't get 0.2:

library(ICC)
ICCbare(y = y, x = J, data = df)
0.00264932

What am I missing?

Best Answer

$ \alpha_j $ needs to be randomly drawn once for each site. The number drawn should then be added to each observation within that site. Your code is currently generating an individual $ \alpha_j $ for each subject at each site, which is wrong.

alpha_j = rnorm(n = n_per_j, mean = 0, sd = sqrt(var_alpha))

Should become

alpha_j = rnorm(n = 1, mean = 0, sd = sqrt(var_alpha))

Using your seed, this estimates an ICC of 0.2805562. Of course, we don't expect the estimated ICC to be exactly 0.2.

Here is a histogram of estimated ICC values for seeds ranging from 1 to 5000. Note that it's centered where you would like it to be.

Fit the model

m<-MCMCglmm(cbind(x,y)~trait-1,
#trait-1 gives each variable a separate intercept
        random=~us(trait):group,
#the random effect has a separate intercept for each variable but allows and estiamtes the covariance between them.
        rcov=~us(trait):units,
#Allows separate residual variance for each trait and estimates the covariance between them
        family=c("gaussian","gaussian"),prior=p,data=df)

In the model summary summary(m) the G structure describes the variance and covariance of the random intercepts. The R structure describes the observation level variance and covariance of intercept, which function as residuals in MCMCglmm.

If you are of a Bayesian persuasion you can get the entire posterior distribution of the co/variance terms m$VCV. Note that these are variances after accounting for the fixed effects.

simulate data

library(MASS)
n<-3000

#draws from a bivariate distribution
df<-data.frame(mvrnorm(n,mu=c(10,20),#the intercepts of x and y
                   Sigma=matrix(c(10,-3,-3,2),ncol=2)))
#the residual variance covariance of x and y


#assign random effect value
number_of_groups<-100
df$group<-rep(1:number_of_groups,length.out=n)
group_var<-data.frame(mvrnorm(number_of_groups, mu=c(0,0),Sigma=matrix(c(3,2,2,5),ncol=2)))
#the variance covariance matrix of the random effects. c(variance of x,
#covariance of x and y,covariance of x and y, variance of y)

#the variables x and y are the sum of the draws from the bivariate distribution and the random effect
df$x<-df$X1+group_var[df$group,1]
df$y<-df$X2+group_var[df$group,2]

Estimating the original co/variance of the random effects requires a large number of levels to the random effect. Instead your model will likely estimate the observed co/variances which can be calculated by cov(group_var)

R – Intraclass Correlation (ICC) for Interaction

The R model formula

lmer(measurement ~ 1 + (1 | subject) + (1 | site), mydata)

fits the model

$$ Y_{ijk} = \beta_0 + \eta_{i} + \theta_{j} + \varepsilon_{ijk} $$

where $Y_{ijk}$ is the $k$'th measurement from subject $i$ at site $j$, $\eta_{i}$ is the subject $i$ random effect, $\theta_{j}$ is the site $j$ random effect and $\varepsilon_{ijk}$ is the leftover error. These random effects have variances $\sigma^{2}_{\eta}, \sigma^{2}_{\theta}, \sigma^{2}_{\varepsilon}$ that are estimated by the model. (Note that if subject is nested within site, you would traditionally write $\theta_{ij}$ here instead of $\theta_{j}$).

To answer your first question regarding how to calculate the ICCs: under this model, the ICCs are the proportion of the total variation explained by the respective blocking factor. In particular, the correlation between two randomly selected observations on the same subject is:

$$ {\rm ICC}({\rm Subject}) = \frac{\sigma^{2}_{\eta}}{\sigma^{2}_{\eta}+ \sigma^{2}_{\theta}+\sigma^{2}_{\varepsilon}}$$

The correlation between two randomly selected observations from the same site is:

$$ {\rm ICC}({\rm Site}) = \frac{\sigma^{2}_{\theta}}{\sigma^{2}_{\eta}+ \sigma^{2}_{\theta}+\sigma^{2}_{\varepsilon}}$$

The correlation between two randomly selected observations on the same individual, and at the same site (the so-called interaction ICC) is:

$$ {\rm ICC}({\rm Subject/Site \ Interaction}) = \frac{\sigma^{2}_{\eta}+\sigma^{2}_{\theta}}{\sigma^{2}_{\eta}+ \sigma^{2}_{\theta}+\sigma^{2}_{\varepsilon}}$$

It seems you were confused by this being referred to as an "interaction" since it's the sum of individual terms. It's an "interaction" in the sense that it estimates the ${\rm ICC}$ corresponding to the blocking factor composed on the combination of Subject and site - it's important to note that you do not have to include some kind of "interaction" term between the factors to estimate this quantity.

Each of these quantities can be estimated by plugging in the estimates of these variances that come out of the model fitting.

Regarding your second question - as you can see here, each ${\rm ICC}$ has a fairly clear interpretation. I would argue that the interaction ${\rm ICC}$ does tell us something interesting - how "similar" are measurements that share both subject and site?

One important point to note is that if subjects are nested within sites, then the Subject ${\rm ICC}$ is not meaningful in it's own right, since it's impossible to share Subject and not site. Then $\sigma^{2}_{\eta}$ becomes only a measure of how much more similar individuals are to themselves, compared to other individuals at their site.

Best Answer

Related Solutions

Solved – Intraclass Correlation Coefficients (ICC) with Multiple Variables

Fit the model

simulate data

R – Intraclass Correlation (ICC) for Interaction

Related Question