I have a dataset with >800 cases ($n$) from >30 ($k$) different organisations (clustered data). The number of cases within each organisation differ (unbalanced data; e.g.: organisation 1 = 30 cases, organisation 2 = 13 cases …).
I want to randomly split the dataset into a calibration (training) and a validation (test) sample in order to cross-validate a structural equation model.
However, I am unsure how I should actually do the split. In my opinion, there are two valid options:
- Randomly splitting the dataset neglecting the clustering into different organisations (randomly choosing participants from different organisations).
- Randomly splitting the dataset based on the clustering (i.e., randomly choosing $k_a = k/2$ organisations for the calibration sample and $k_b = k – k_a$ organisations for the validation sample).
Option 1 has the advantage that I get two samples that have identical sample sizes ($n_a = n_b$). Option 2, on the other hand, has the advantage to take the clustered data structure into account but produces samples with different sample sizes ($n_a \neq n_b$).
Is there a preferred way to split datasets in cases of clustered data structures?
Ps.: I calculated intraclass correlation coefficients (ICC1
, in R
with multilevel::mult.icc
) for all dependent variables. The ICC
is below .1 for all variables. It can therefore be assumed that only small amounts of variance are explained due to organisational membership.
PPs.: I added machine-learning
as tag since cross validation is often done in this field.
Edit:
I reconsidered the whole problem and came up with another option:
- Randomly choosing ~50% of individuals out of each of the $k$ different organisations. This approach would allow to keep the original cluster structure in both subsamples and $n_a = n_b$.
However, I am still quite unsure how to tackle the subsetting since I do not have a rational that guides me. I didn't not find literature yet that considers such issues.
Best Answer
Your sample size of 800 is too low by an order of magnitude for data splitting to be a reliable validation method. You will get much different results each time you split. I suggest using the optimism bootstrap, repeating all possible modeling steps each of say 400 times.
In the R
rms
packagevalidate
andcalibrate
series of functions there are options for clustered/grouped bootstrapping.