Background:
- Train set: data used to train the chosen model
- Dev set: data used to tune the model's parameters
- Test set: data used to evaluate the performance of the final model
How cross-validation is done when splitting the data into a train set, a dev set and a test set instead of just train/test sets? I could not find any reference on this matter in the litterature.
My intuition would be to perform a two-step cross-validation. For instance, if we want to do a 10 fold cross-validation, we would do a first a basic 10 fold cross validation to separate the train and test set. And then we will split up the train set into train and dev set using a 9 (10-1) cross-validation. We will end up with 80% train, 10% dev, 10% test.
This method respects the generalization wanted by the cross-validation methodology. However the number of computations is (almost) squared, which is huge.
Another possibility would be to do a 5 (10/2) fold cross-validation to split the data into train and dev+test set. And split the dev+test set at the middle to recover the dev and test sets individually. We will also end up with 80% train, 10% dev and 1°% test.
What is your opinion on this ?
Best Answer
I will just answer my own question, since I have found the answer I was looking for.
What I wanted to do is a nested cross-validation. It is made of:
So if I want to split my data into 80%/10%/10% (train/dev/test), I will first do a 10-fold cross validation (outer loop: 90% (train+dev) and 10% test). And for every fold, I will do a 9-fold cross-validation to select the best parameters (inner loop: 80% train and 10% dev).
How does it compare to a simple cross-validation in terms of number of trained models ?
So the nested-CV is much more expensive than basic CV and depends on the paramrter space size (if doing grid search) and the CV factor of the inner loop. Smart parameter space restriction could speed up de computations quite a bit.
Additional information on nested-CV: