Solved – the purpose of crossvalidation

cross-validationmachine learningoptimization

In this post on stackexchange, the answer states that "The purpose of cross-validation is model checking, not model building." A very good explanation for that is given as follows: "(…) selecting one of the surrogate models means selecting a subset of training samples and claiming that this subset of training samples leads to a superior model."

While this is intuitive, usually

  1. we pick the best performing classifier of the cross-validation, and test this on a further set – usually called the test set. So, we do in fact use cross-validation for building a model in picking a particular one, often in combination with hyperparameter optimization we would chose a specific set of hyperparameters which led to the best validation-set-results. Is this not contradicting the above statement?

  2. if the results of the crossvalidation are just dependent on the particular validation-set, why is model selection justified this way?

Best Answer

we pick the best performing classifier of the cross-validation, and test this on a further set - usually called the test set. So, we do in fact use cross-validation for building a model in picking a particular one, often in combination with hyperparameter optimization we would chose a specific set of hyperparameters which led to the best validation-set-results. Is this not contradicting the above statement?

The situation may be more easily explained if you divide it into a different set of building blocks.

  1. there are techniques to measure model performance, e.g. cross validation, testing a single held-out split of your data, auto-prediction, performing a fully blown validation study.
    They differ wrt. efficiency of data use, systematic (bias) and random (variance) uncertainty, cost/effort etc.
    But technically you can use any of these for the building block "estimate performance"

  2. Optimization, here: choose a good model from a variety of possible models. There are many criteria what a good model is. But one criterion that is widely applicable is predictive performance.
    So if you choose to use this optimization criterion, you then need to choose a suitable way of measuring/estimating the performance of the (surrogate) model you are considering. Doing a full validation study for lots of models isn't feasible, auto-prediction doesn't yield enough information, depending on your data, a single held-out split may be subject to too much random uncertainty, so in the end you settle for cross validation.

So cross validation is used in model optimization just as a light switch is used in a car: you need something to make the light go on and off as needed and you use a solution that already (and primarily) exists outside cars. But thinking of light switches primarily as "something used inside cars" likely doesn't help much understanding how they work and the specific characteristics are. They exist independent of cars, but can be applied inside cars - just as they can be applied inside houses, other machines, etc.
Similarly cross validation exists as a validation (actually: verification) technique, and this technique can be applied for calculating the target functional of your optimization. Or for verification of the optimized model (see below). Or ...

  1. Final Verification: (not really a new building block, but nevertheless necessary) Because we know that variance uncertainty in the target functional used for pick-the-maximum optimization tends to lead to overfitting, we get another independent measurement of the performance of the model we decided on. Again, we have the choice of the methods 1. I may choose to do another cross validation there (aka outer cross validation to distinguish it from the "inner" cross validation inside the optimization) - you may choose to go for the "further test set".

if the results of the crossvalidation are just dependent on the particular validation-set, why is model selection justified this way?

You should not base your decision on the result obtained on a single validation set. Neither a single cross validation surrogate, nor on a single held-out split of your data. Instead, you should judge those results taking into account their uncertainty.
If you suspect that the surrogate models are unstable, i.e. the actual splitting having an influence on the result, you should calculate more splits and directly check this. E.g. by repeated/iterated cross validation.

Keep in mind: The fact that many people happily overfit their models doesn't mean that this is good practice...