Solved – How is cross validation different from data snooping

cross-validationmachine learning

I just finished "An Introduction to Statistical Learning". I wondered whether using cross-validation to find the best tuning parameters for various machine learning techniques is different from data snooping?

We are repeatedly checking which value of the tuning parameter results in a best predictive result in the test set. What if the tuning parameter we arrive at just happens to fit this particular test set by chance, and won't perform well on some future test set?

Please excuse my novice understanding of machine learning, and I'm eager to be educated.

EDIT: Please see @AdamO answer on the definition of "data snooping". I used the term very inaccurately in my question.

Best Answer

I wondered whether using cross-validation to find the best tuning parameters for various machine learning techniques is different from data snooping?

Your concern is right spot on, and there is a whole lot of literature on this topic, e.g.

The problem is that hyperparameter tuning with cross validation is a data-driven optimization process, and will still tend to overfit to yor data set (less than tuning by resubstitution error, but still). Trying to use the tuning cross validation results as "independent" performance measure is in a way like eating the pie (= tuning) and keeping (= measure final model performance) it.

This does not mean that you shouldn't use cross-validation for hyperparameter tuning. It just means that you can use it only for one purpose. Either optimize or measure model performance for validation purposes.

The solution is that you need to do an independent validation for measuring the quality of the model obtained with the tuned hyperparameters. This is called nested or double validation. You'll find a number of questions and answers here on these topics

Conceptually, I like to say that training includes all kinds of fancy steps to fit not only the "usual" model parameters but also to fit (auto-tune) the hyperparameters. So data-driven optimization of λ is clearly part of the model training.

As a rule of thumb you can also say that model training is everything that needs to be done before you have a ready-to-use final black-box function that is able to produce predictions for new cases.


PS: I find the testing vs. validation terminology very confusing because in my field "validation" means proving that the final model is fit for purpose, and is therefore what other people call testing rather than validation. I prefer to call the inner test set "tuning test set" and the outer "final validation test set" or the like.


Update:

So if my model (i.e. my tuning parameter in this case) fails the outer validation, what should I do then?

Typically, this is nothing that just happens: there are typical situations that can cause such a failure. And all such situations that I'm aware of are overfitting situations. You need to be aware that while regularization helps to reduce the necessary number of training cases, data-driven optimization needs large amounts of data.

My recommendations:

  • Typically, you (should) already have rough expectations, e.g. what performance should be achievable, what performance you'd consider suspiciously good looking. Or have specs what performance you need to achieve and a baseline performance. From that and the number of availabe training cases (for the splitting scheme you decided for), calculate the expected uncertainty for the inner (tuning) tests. If that uncertainty indicates that you would not be able to get meaningful comparisons, don't do data-driven optimization.

  • You should check how stable both the obtained predictions with the chosen λ and the optimal λ found by the auto-tuning procedure are. If λ isn't reasonably stable with respect to different splits of your data, the optimization didn't work.

  • If you find that either you won't be able to do the data-driven optimization or that it didn't work after all, you may choose the λ by your expert knowledge, e.g. from experience with similar data. Or by the knowledge that if you find out that the optimization failed, you'll need a stronger regularization: the overfitting that leads to the failure works towards too complex models.