Cross-Validation – How to Use the ‘Test’ Dataset After Cross-Validation

cross-validationmachine learningvalidation

In some lectures and tutorials I've seen, they suggest to split your data into three parts: training, validation and test. But it is not clear how the test dataset should be used, nor how this approach is better than cross-validation over the whole data set.

Let's say we have saved 20% of our data as a test set. Then we take the rest, split it into k folds and, using cross-validation, we find the model that makes the best prediction on unknown data from this dataset. Let's say the best model we have found gives us 75% accuracy.

Various tutorials and lots of questions on various Q&A websites say that now we can verify our model on a saved (test) dataset. But I still can't get how exactly is it done, nor what is the point of it.

Let's say we've got an accuracy of 70% on the test dataset. So what do we do next? Do we try another model, and then another, until we will get a high score on our test dataset? But in this case it really looks like we will just find the model that fits our limited (only 20%) test set. It doesn't mean that we will find the model that is best in general.

Moreover, how can we consider this score as a general evaluation of the model, if it is only calculated on a limited data set? If this score is low, maybe we were unlucky and selected "bad" test data.

On the other hand, if we use all the data we have and then choose the model using k-fold cross-validation, we will find the model that makes the best prediction on unknown data from the entire data set we have.

Best Answer

This is similar to another question I answered regarding cross-validation and test sets. The key concept to understand here is independent datasets. Consider just two scenarios:

  1. If you have lot's of resources you would ideally collect one dataset and train your model via cross-validation. Then you would collect another completely independent dataset and test your model. However, as I said previously, this is usually not possible for many researchers.

Now, if I am a researcher who isn't so fortunate what do I do? Well, you can try to mimic that exact scenario:

  1. Before you do any model training you would take a split of your data and leave it to the side (never to be touched during cross-validation). This is to simulate that very same independent dataset mentioned in the ideal scenario above. Even though it comes from the same dataset the model training won't take any information from those samples (where with cross-validation all the data is used). Once you have trained your model you would then apply it to your test set, again that was never seen during training, and get your results. This is done to make sure your model is more generalizable and hasn't just learned your data.

To address your other concerns:

Let's say we've got an accuracy of 70% on test data set, so what do we do next? Do we try an other model, and then an other, untill we will get hight score on our test data set?

Sort of, the idea is that you are creating the best model you can from your data and then evaluating it on some more data it has never seen before. You can re-evaluate your cross-validation scheme but once you have a tuned model (i.e. hyper parameters) you are moving forward with that model because it was the best you could make. The key is to NEVER USE YOUR TEST DATA FOR TUNING. Your result from the test data is your model's performance on 'general' data. Replicating this process would remove the independence of the datasets (which was the entire point). This is also address in another question on test/validation data.

And also, how can we consider this score as general evaluation of the model, if it is calculated on a limited data set? If this score is low, maybe we were unlucky to select "bad" test data.

This is unlikely if you have split your data correctly. You should be splitting your data randomly (although potentially stratified for class balancing). If you dataset is large enough that you are splitting your data in to three parts, your test subset should be large enough that the chance is very low that you just chose bad data. It is more likely that your model has been overfit.

Related Question