Solved – Should I use the cross validation score or the test score to evaluate a machine learning model

cross-validationmodel-evaluation

Let's say I want to compare two machine learning models (A and B) on a classification problem. I split my data into train (80%) and test set (20%). Then I perform 4-fold cross-validation on the training set (so every time my validation set has 20% of the data).

The average over the folds cross validation accuracy I get is:

model A – 80%

model B – 90%

Finally, I test the models on the test set and get the accuracies:

model A – 90%

model B – 80%

Which model would you choose?

The test result is more representative of the generalization ability of the model because it has never been used during the training process. However the cross-validation result is more representative because it represents the performance of the system on the 80% of the data instead of just the 20% of the training set. Moreover, if I change the split of my sets, the different test accuracies I get have a high variance but the average cross validation accuracy is more stable.

Best Answer

First of all, if the cross validation results are actually not used to decide anything (no parameter tuning, no selection, nothing) then you don't gain anything by the test set you describe:

  • your splitting in to training/test is subject to the same difficulties as you subsequent splitting of the training set into the surrogate training and cross validation surrogate test sets. Any data leakage (e.g. due to confounders you did not account for) happens to both.
  • in addition, as you say, the 20 % test yset is smaller. Whether this is a problem or not depends largely on the absolute number of cases you have. If 20 % of your data are sufficient to yield test results with a suitable precision for your application at hand, then you are fine.

That being said, selecting a model is part of the training of the final model. Thus, the selected model needs to undergo independent validation.

In your case, this means: select according to your cross validation, e.g. model B (although you may want to look into more sophisticated selection rules that take instability into account). Then do an independent test of the selected model. That result is your validation (or better: verification) result for the final model. Here: 80 %.
However, you can use an additional outer cross validation for that, avoiding the difficulty of having only few test cases for the final verification.