Solved – Create synthetic data with a given intraclass correlation coefficient (ICC)

intraclass-correlationrsynthetic data

I want to generate some synthetic data with $I$ observations across $J$ clusters. Additionally, I want the intraclass correlation coefficient ($ICC$) to be an input of my data generation process. So, at the end I want to end-up with a data frame that has 2 columns: 1. a cluster ID, 2. outcome

$Y_{ij}$ is the outcome for individual $i$ in cluster $j$

$$ Y_{ij} = \mu + \alpha_j + \epsilon_{ij} $$

The intraclass correlation is defined as

$$ ICC = \frac{\sigma_\alpha^2}{ \sigma_\alpha^2 + \sigma_\epsilon^2}$$

So, if i want $ICC = 0.2$ and $\sigma_\epsilon^2 = 1$, I can solve for $\sigma_\alpha^2$

f_var_alpha <- function(ICC, var_epsilon){
  var_alpha <- (ICC*var_epsilon)/(1-ICC)
  return(var_alpha)
}

var_alpha <-f_var_alpha(ICC = 0.2, var_epsilon = 1)

var_alpha

Now I could use $\sigma_\alpha^2 = 0.25$ to generate my data.

Suppose that I want to generate 1000 observations across 10 clusters, and that $\mu = 0.5$. This is what i did:

library(tidyverse)
set.seed(22217)
N <- 1000
J <- 10

n_per_j <- N/J

gen_data_j <- function(j, J, N){
  n_per_j <- N/J
  cluster_j <- data.frame(J=LETTERS[j], 
                          alpha_j = rnorm(n = n_per_j, mean = 0, sd = sqrt(var_alpha)),
                          epsilon_ij = rnorm(n = n_per_j, mean = 0, sd = 1)) %>% 
    mutate(y = 0.5 + alpha_j + epsilon_ij)
  return(cluster_j)
}

df <- lapply(X = 1:J, FUN = gen_data_j, N=N, J=J) %>% bind_rows() %>% 
  mutate(J = as.factor(J))

Alas, if I check the ICC i don't get 0.2:

library(ICC)
ICCbare(y = y, x = J, data = df)
0.00264932

What am I missing?

Best Answer

$ \alpha_j $ needs to be randomly drawn once for each site. The number drawn should then be added to each observation within that site. Your code is currently generating an individual $ \alpha_j $ for each subject at each site, which is wrong.

alpha_j = rnorm(n = n_per_j, mean = 0, sd = sqrt(var_alpha))

Should become

alpha_j = rnorm(n = 1, mean = 0, sd = sqrt(var_alpha))

Using your seed, this estimates an ICC of 0.2805562. Of course, we don't expect the estimated ICC to be exactly 0.2.

Here is a histogram of estimated ICC values for seeds ranging from 1 to 5000. Note that it's centered where you would like it to be.

enter image description here