Let's say I want to compare two machine learning models (A and B) on a classification problem. I split my data into train (80%) and test set (20%). Then I perform 4-fold cross-validation on the training set (so every time my validation set has 20% of the data).
The average over the folds cross validation accuracy I get is:
model A – 80%
model B – 90%
Finally, I test the models on the test set and get the accuracies:
model A – 90%
model B – 80%
Which model would you choose?
The test result is more representative of the generalization ability of the model because it has never been used during the training process. However the cross-validation result is more representative because it represents the performance of the system on the 80% of the data instead of just the 20% of the training set. Moreover, if I change the split of my sets, the different test accuracies I get have a high variance but the average cross validation accuracy is more stable.
Best Answer
First of all, if the cross validation results are actually not used to decide anything (no parameter tuning, no selection, nothing) then you don't gain anything by the test set you describe:
That being said, selecting a model is part of the training of the final model. Thus, the selected model needs to undergo independent validation.
In your case, this means: select according to your cross validation, e.g. model B (although you may want to look into more sophisticated selection rules that take instability into account). Then do an independent test of the selected model. That result is your validation (or better: verification) result for the final model. Here: 80 %.
However, you can use an additional outer cross validation for that, avoiding the difficulty of having only few test cases for the final verification.