Solved – In k-fold cross validation does the training subsample include test set

cross-validation

In this Wikipedia page in subsection for K-fold cross validation it says "In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data". Test data is not in the picture at all.

Whereas in a book I read the author clearly indicates

  1. The full data is divided into three sets: Training set, test set, and validation set (or subsamples in Wikipedias language).
  2. Of the k subsamples one subsample is retained as the validation data, one other subsample is retained as the test data, and k-2 subsamples are used as training data.

Which is true?

Best Answer

They are both correct in their own context. They are describing two different ways of model selection in different situations.

In general, when you are doing model selection and testing, your data is divided into three parts, training set, validation set and testing set. You use your training set to train different models, estimate the performance on your validation set, then select the model with optimal performance and test it on your testing set.

On the other hand, if you are using K-fold cross-validation to estimate the performance of a model, your data is then divided into K folds, you loop through the K folds and each time use one fold as testing(or validation) set and use the rest (K-1) folds as training set. Then you average across all folds to get the estimated testing performance of your model. This is what the Wikipedia page is referring to.

But keep in mind that this is for testing a specific model, if you have multiple candidate models and want to do model-selection as well, you have to select a model only with your training set to avoid this subtle circular logic fallacy. So you further divide your (K-1) folds 'training data' into two parts, one for training and one for validation. This means you do an extra 'cross-validation' first to select the optimal model within the (K-1) folds, and then you test this optimal model on your testing fold. In other words, you are doing a two-level cross-validation, one is the K-fold cross-validation in general, and within each cross-validation loop, there is an extra (K-1)-fold cross-validation for model selection. Then you have what you stated in your question, 'Of the k subsamples one subsample is retained as the validation data, one other subsample is retained as the test data, and k-2 subsamples are used as training data.'