Solved – Grouped 7-fold Cross Validation in R

accuracycaretcross-validationrrandom forest

I am searching for a grouped 7-fold cross validation function. I couldn't find it in the caret package.

I got 70 subjects performing 7 trials (Outcome variable: categorical with 7 values) = 490 observations. I trained a Random Forest with reasonable accuracy in the OOB (89%) as well as in 10 fold CV. Since the data is hierarchical / dependent (7 observations belonging to one subject) a colleague suggested it would be advisable to prevent that trials from the same subject are in the train split as well as in the test split.

What do you think, should I do 7 – fold CV grouped by subject? Meaning that one fold would allways include all trials of 10 participants?

Thanks in advance

Edit:
Thanks for your comment. I missed just the documentation in caret about groupKFold. Here is a code solution which worked for me

########################## Caret Preparation ############################
k.folds = 7
df1.folds <- groupKFold(df1$ID, k = k.folds) 
df2.folds <- groupKFold(df2$ID, k = k.folds) 
df1.control <- trainControl( # 7 Folds grouped by subject cross validation, repeated 3 times
                        method="repeatedcv", 
                        number=k.folds, 
                        repeats=3,
                        index =df1.folds)

df2.control <- trainControl( # 7 Folds grouped by subject cross validation, repeated 3 times
  method="repeatedcv", 
  number=k.folds, 
  repeats=3,
  index =df2.folds)

Edit 2 (26.11.21):
Please see the answer provided by @otwtm, providing the index argument (as created by in my case groupKFold which is basically just a list of the indicies used for training) overwrites the arguments number and repeats.

########################## Caret Preparation ############################
k.folds = 7
df1.folds <- groupKFold(df1$ID, k = k.folds) 
df2.folds <- groupKFold(df2$ID, k = k.folds) 
df1.control <- trainControl( # 7 Folds grouped by subject
                        method="repeatedcv", 
                        index =df1.folds)

df2.control <- trainControl( # 7 Folds grouped by subject 
  method="repeatedcv", 
  index =df2.folds)

Best Answer

Yes, do make sure you are testing unknown patients.

(I work with highly multivariate data also with multiple measurements per subject and have met situations where not splitting train patients vs. test patients would underestimate the prediction error by an order of magnitude!)

Related Solutions

K-Fold Cross Validation – Which K-Fold Cross Validation Strategy is Better?

I don't quite understand your methods, but here's what I know as cross validation sub-schemes, maybe that helps you clarifying the question:

assume you have 9 samples that are ordered 1 to 9, and you're doing 3-fold CV.

block wise: the data is divided into 3 consecutive blocks:
```
case    1    2    3    4    5    6    7    8    9
fold    1    1    1    2    2    2    3    3    3
```
~~I see hardly any application where this would be useful.~~ This can be useful to answer extract hints about extrapolation behaviour: the first and the last block then tell you how the model does at extrapolating just outside the domain covered by the training data (calibration range in chemometrics).
interleaved or stripes or ventian blinds: 1st case is assigned to fold 1, 2nd to fold 2, and so on:
```
case    1    2    3    4    5    6    7    8    9
fold    1    2    3    1    2    3    1    2    3
```
This is sometimes used for (chemical) calibration. Samples are sorted with e.g. increasing concentration of the analyte. This assignment scheme guarantees that both training and test cases for the surrogate models always span the concentration range as far (and evenly spaced) as possible.
random: you assign the cases to folds in a random fashion:
```
case    1    2    3    4    5    6    7    8    9
fold    3    3    1    1    2    1    2    3    2
```
You can do that by mixing your cases, and then using one of the above schemes.

IMHO the random scheme offers a crucial advantage: you can repeat the procedure. This is known as iterated or repeated $k$-fold cross validation. The iterations help you to reduce variance that is due to instability of the surrogate models (and to measure this instability), which is not possible with the upper 2 schemes. So iterated k-fold CV is the best and it implies random assignment, unless you have specific reasons for using one of the non-random schemes.

Note that if $k = n$, all 3 schemes are the same.

Cross validation always guarantees that each sample is tested exactly once during each iteration, and used exactly $k - 1$ times for training. If your splitting scheme doesn't have this property, it is not a cross validation. There are other splitting/resampling schemes for validation, such as hold-out/set validation (as opposed to 2-fold CV), out-of-bootstrap validation, etc.

Solved – Caret – Repeated K-fold cross-validation vs Nested K-fold cross validation, repeated n-times

There's nothing wrong with the (nested) algorithm presented, and in fact, it would likely perform well with decent robustness for the bias-variance problem on different data sets. You never said, however, that the reader should assume the features you were using are the most "optimal", so if that's unknown, there are some feature selection issues that must first be addressed.

FEATURE/PARAMETER SELECTION

A lesser biased approached is to never let the classifier/model come close to anything remotely related to feature/parameter selection, since you don't want the fox (classifier, model) to be the guard of the chickens (features, parameters). Your feature (parameter) selection method is a $wrapper$ - where feature selection is bundled inside iterative learning performed by the classifier/model. On the contrary, I always use a feature $filter$ that employs a different method which is far-removed from the classifier/model, as an attempt to minimize feature (parameter) selection bias. Look up wrapping vs filtering and selection bias during feature selection (G.J. McLachlan).

There is always a major feature selection problem, for which the solution is to invoke a method of object partitioning (folds), in which the objects are partitioned in to different sets. For example, simulate a data matrix with 100 rows and 100 columns, and then simulate a binary variate (0,1) in another column -- call this the grouping variable. Next, run t-tests on each column using the binary (0,1) variable as the grouping variable. Several of the 100 t-tests will be significant by chance alone; however, as soon as you split the data matrix into two folds $\mathcal{D}_1$ and $\mathcal{D}_2$, each of which has $n=50$, the number of significant tests drops down. Until you can solve this problem with your data by determining the optimal number of folds to use during parameter selection, your results may be suspect. So you'll need to establish some sort of bootstrap-bias method for evaluating predictive accuracy on the hold-out objects as a function of varying sample sizes used in each training fold, e.g., $\pi=0.1n, 0.2n, 0,3n, 0.4n, 0.5n$ (that is, increasing sample sizes used during learning) combined with a varying number of CV folds used, e.g., 2, 5, 10, etc.

OPTIMIZATION/MINIMIZATION

You seem to really be solving an optimization or minimization problem for function approximation e.g., $y=f(x_1, x_2, \ldots, x_j)$, where e.g. regression or a predictive model with parameters is used and $y$ is continuously-scaled. Given this, and given the need to minimize bias in your predictions (selection bias, bias-variance, information leakage from testing objects into training objects, etc.) you might look into use of employing CV during use of swarm intelligence methods, such as particle swarm optimization(PSO), ant colony optimization, etc. PSO (see Kennedy & Eberhart, 1995) adds parameters for social and cultural information exchange among particles as they fly through the parameter space during learning. Once you become familiar with swarm intelligence methods, you'll see that you can overcome a lot of biases in parameter determination. Lastly, I don't know if there is a random forest (RF, see Breiman, Journ. of Machine Learning) approach for function approximation, but if there is, use of RF for function approximation would alleviate 95% of the issues you are facing.

Best Answer

Related Solutions

K-Fold Cross Validation – Which K-Fold Cross Validation Strategy is Better?

Solved – Caret – Repeated K-fold cross-validation vs Nested K-fold cross validation, repeated n-times

Related Question