Solved – Cross-validation when splitting data into train/dev/test sets

cross-validation

Background:

  • Train set: data used to train the chosen model
  • Dev set: data used to tune the model's parameters
  • Test set: data used to evaluate the performance of the final model

How cross-validation is done when splitting the data into a train set, a dev set and a test set instead of just train/test sets? I could not find any reference on this matter in the litterature.

My intuition would be to perform a two-step cross-validation. For instance, if we want to do a 10 fold cross-validation, we would do a first a basic 10 fold cross validation to separate the train and test set. And then we will split up the train set into train and dev set using a 9 (10-1) cross-validation. We will end up with 80% train, 10% dev, 10% test.

This method respects the generalization wanted by the cross-validation methodology. However the number of computations is (almost) squared, which is huge.

Another possibility would be to do a 5 (10/2) fold cross-validation to split the data into train and dev+test set. And split the dev+test set at the middle to recover the dev and test sets individually. We will also end up with 80% train, 10% dev and 1°% test.

What is your opinion on this ?

Best Answer

I will just answer my own question, since I have found the answer I was looking for.

What I wanted to do is a nested cross-validation. It is made of:

  • An inner loop that is doing the model tuning using K-fold cross-validation for every combination parameters (using grid search or random search)
  • An outer loop doing the model evaluation using K'-fold cross-validation using the best combination of parameters of the inner loop

So if I want to split my data into 80%/10%/10% (train/dev/test), I will first do a 10-fold cross validation (outer loop: 90% (train+dev) and 10% test). And for every fold, I will do a 9-fold cross-validation to select the best parameters (inner loop: 80% train and 10% dev).

How does it compare to a simple cross-validation in terms of number of trained models ?

  • CV: K models instead of 1
  • nested-CV: K * K' * number of parameters combinations

So the nested-CV is much more expensive than basic CV and depends on the paramrter space size (if doing grid search) and the CV factor of the inner loop. Smart parameter space restriction could speed up de computations quite a bit.

Additional information on nested-CV: