Solved – How does k-fold cross validation fit in the context of training/validation/testing sets

cross-validationdatasetoverfitting

My main question is with regards trying to understand how k-fold cross-validation fits in the context of having training/validation/testing sets (if it fits at all in such context).

Usually, people speak of splitting the data into a training, validation and testing set – say at a ratio of 60/20/20 per Andrew Ng's course – whereby the validation set is used to identify optimal parameters for model training.

However, if one wanted to use k-fold cross-validation in hope of obtaining a more representative accuracy measure when the amount of data is relatively small, what does doing k-fold cross-validation entail exactly in this 60/20/20 split scenario?

For instance, would that mean that we'd actually combine the training and testing sets (80% of the data) and do k-fold cross validation on them to obtain our accuracy measure (effectively discarding with having an explicit 'testing set'? If so, which trained model do we use a) in production, and b) to use against validation set and identify optimal training parameters?
For instance, one possible answer for a and b is perhaps to use the best-fold model.

Best Answer

Cross-validation usually helps to avoid the need of a validation set.

The basic idea with training/validation/test data sets is as follows:

  1. Training: You try out different types of models with different choices of hyperparameters on the training data (e.g. linear model with different selection of features, neural net with different choices of layers, random forest with different values of mtry).

  2. Validation: You compare the performance of the models in Step 1 based on the validation set and select the winner. This helps to avoid wrong decisions taken by overfitting the training data set.

  3. Test: You try out the winner model on the test data just to get a feeling how good it performs in reality. This unravels overfitting introduced in Step 2. Here, you would not take any further decision. It is just plain information.

Now, in the case where you replace the validation step by cross-validation, the attack on the data is done almost identically, but you only have a training and a test data set. There is no need for a validation data set.

  1. Training: See above.

  2. Validation: You do cross-validation on the training data to choose the best model of Step 1 with respect to cross-validation performance (here, the original training data is repeatedly split into a temporary training and validation set). The models calculated in cross-validation are only used for choosing the best model of Step 1, which are all computed on the full training set.

  3. Test: See above.