Solved – Split clustered data into calibration and validation sample (Cross validation)

cross-validationmachine learning

I have a dataset with >800 cases ($n$) from >30 ($k$) different organisations (clustered data). The number of cases within each organisation differ (unbalanced data; e.g.: organisation 1 = 30 cases, organisation 2 = 13 cases …).

I want to randomly split the dataset into a calibration (training) and a validation (test) sample in order to cross-validate a structural equation model.

However, I am unsure how I should actually do the split. In my opinion, there are two valid options:

  1. Randomly splitting the dataset neglecting the clustering into different organisations (randomly choosing participants from different organisations).
  2. Randomly splitting the dataset based on the clustering (i.e., randomly choosing $k_a = k/2$ organisations for the calibration sample and $k_b = k – k_a$ organisations for the validation sample).

Option 1 has the advantage that I get two samples that have identical sample sizes ($n_a = n_b$). Option 2, on the other hand, has the advantage to take the clustered data structure into account but produces samples with different sample sizes ($n_a \neq n_b$).

Is there a preferred way to split datasets in cases of clustered data structures?

Ps.: I calculated intraclass correlation coefficients (ICC1, in R with multilevel::mult.icc) for all dependent variables. The ICC is below .1 for all variables. It can therefore be assumed that only small amounts of variance are explained due to organisational membership.

PPs.: I added machine-learning as tag since cross validation is often done in this field.

Edit:

I reconsidered the whole problem and came up with another option:

  1. Randomly choosing ~50% of individuals out of each of the $k$ different organisations. This approach would allow to keep the original cluster structure in both subsamples and $n_a = n_b$.

However, I am still quite unsure how to tackle the subsetting since I do not have a rational that guides me. I didn't not find literature yet that considers such issues.

Best Answer

Your sample size of 800 is too low by an order of magnitude for data splitting to be a reliable validation method. You will get much different results each time you split. I suggest using the optimism bootstrap, repeating all possible modeling steps each of say 400 times.

In the R rms package validate and calibrate series of functions there are options for clustered/grouped bootstrapping.