Solved – Grouped 7-fold Cross Validation in R

accuracycaretcross-validationrrandom forest

I am searching for a grouped 7-fold cross validation function. I couldn't find it in the caret package.

I got 70 subjects performing 7 trials (Outcome variable: categorical with 7 values) = 490 observations. I trained a Random Forest with reasonable accuracy in the OOB (89%) as well as in 10 fold CV. Since the data is hierarchical / dependent (7 observations belonging to one subject) a colleague suggested it would be advisable to prevent that trials from the same subject are in the train split as well as in the test split.

What do you think, should I do 7 – fold CV grouped by subject? Meaning that one fold would allways include all trials of 10 participants?

Thanks in advance

Edit:
Thanks for your comment. I missed just the documentation in caret about groupKFold. Here is a code solution which worked for me

########################## Caret Preparation ############################
k.folds = 7
df1.folds <- groupKFold(df1$ID, k = k.folds) 
df2.folds <- groupKFold(df2$ID, k = k.folds) 
df1.control <- trainControl( # 7 Folds grouped by subject cross validation, repeated 3 times
                        method="repeatedcv", 
                        number=k.folds, 
                        repeats=3,
                        index =df1.folds)

df2.control <- trainControl( # 7 Folds grouped by subject cross validation, repeated 3 times
  method="repeatedcv", 
  number=k.folds, 
  repeats=3,
  index =df2.folds)

Edit 2 (26.11.21):
Please see the answer provided by @otwtm, providing the index argument (as created by in my case groupKFold which is basically just a list of the indicies used for training) overwrites the arguments number and repeats.

########################## Caret Preparation ############################
k.folds = 7
df1.folds <- groupKFold(df1$ID, k = k.folds) 
df2.folds <- groupKFold(df2$ID, k = k.folds) 
df1.control <- trainControl( # 7 Folds grouped by subject
                        method="repeatedcv", 
                        index =df1.folds)

df2.control <- trainControl( # 7 Folds grouped by subject 
  method="repeatedcv", 
  index =df2.folds)

Best Answer

Yes, do make sure you are testing unknown patients.

(I work with highly multivariate data also with multiple measurements per subject and have met situations where not splitting train patients vs. test patients would underestimate the prediction error by an order of magnitude!)