I want to generate some synthetic data with $I$ observations across $J$ clusters. Additionally, I want the intraclass correlation coefficient ($ICC$) to be an input of my data generation process. So, at the end I want to end-up with a data frame that has 2 columns: 1. a cluster ID, 2. outcome
$Y_{ij}$ is the outcome for individual $i$ in cluster $j$
$$ Y_{ij} = \mu + \alpha_j + \epsilon_{ij} $$
The intraclass correlation is defined as
$$ ICC = \frac{\sigma_\alpha^2}{ \sigma_\alpha^2 + \sigma_\epsilon^2}$$
So, if i want $ICC = 0.2$ and $\sigma_\epsilon^2 = 1$, I can solve for $\sigma_\alpha^2$
f_var_alpha <- function(ICC, var_epsilon){
var_alpha <- (ICC*var_epsilon)/(1-ICC)
return(var_alpha)
}
var_alpha <-f_var_alpha(ICC = 0.2, var_epsilon = 1)
var_alpha
Now I could use $\sigma_\alpha^2 = 0.25$ to generate my data.
Suppose that I want to generate 1000 observations across 10 clusters, and that $\mu = 0.5$. This is what i did:
library(tidyverse)
set.seed(22217)
N <- 1000
J <- 10
n_per_j <- N/J
gen_data_j <- function(j, J, N){
n_per_j <- N/J
cluster_j <- data.frame(J=LETTERS[j],
alpha_j = rnorm(n = n_per_j, mean = 0, sd = sqrt(var_alpha)),
epsilon_ij = rnorm(n = n_per_j, mean = 0, sd = 1)) %>%
mutate(y = 0.5 + alpha_j + epsilon_ij)
return(cluster_j)
}
df <- lapply(X = 1:J, FUN = gen_data_j, N=N, J=J) %>% bind_rows() %>%
mutate(J = as.factor(J))
Alas, if I check the ICC i don't get 0.2:
library(ICC)
ICCbare(y = y, x = J, data = df)
0.00264932
What am I missing?
Best Answer
$ \alpha_j $ needs to be randomly drawn once for each site. The number drawn should then be added to each observation within that site. Your code is currently generating an individual $ \alpha_j $ for each subject at each site, which is wrong.
Should become
Using your seed, this estimates an ICC of 0.2805562. Of course, we don't expect the estimated ICC to be exactly 0.2.
Here is a histogram of estimated ICC values for seeds ranging from 1 to 5000. Note that it's centered where you would like it to be.