Testing Data – How to Ensure Testing Data Does Not Leak into Training Data

classificationcross-validationmachine learningout-of-samplepredictive-models

Suppose we have someone building a predictive model, but that someone is not necessarily well-versed in proper statistical or machine learning principles. Maybe we are helping that person as they are learning, or maybe that person is using some sort of software package that requires minimal knowledge to use.

Now this person might very well recognize that the real test comes from accuracy (or whatever other metric) on out-of-sample data. However, my concern is that there are a lot of subtleties there to worry about. In the simple case, they build their model and evaluate it on training data and evaluate it on held-out testing data. Unfortunately it can sometimes be all too easy at that point to go back and tweak some modeling parameter and check the results on that same "testing" data. At this point that data is no longer true out-of-sample data though, and overfitting can become a problem.

One potential way to resolve this problem would be to suggest creating many out-of-sample datasets such that each testing dataset can be discarded after use and not reused at all. This requires a lot of data management though, especially that the splitting must be done before the analysis (so you would need to know how many splits beforehand).

Perhaps a more conventional approach is k-fold cross validation. However, in some sense that loses the distinction between a "training" and "testing" dataset that I think can be useful, especially to those still learning. Also I'm not convinced this makes sense for all types of predictive models.

Is there some way that I've overlooked to help overcome the problem of overfitting and testing leakage while still remaining somewhat clear to an inexperienced user?

Best Answer

You are right, this is a significant problem in machine learning/statistical modelling. Essentially the only way to really solve this problem is to retain an independent test set and keep it held out until the study is complete and use it for final validation.

However, inevitably people will look at the results on the test set and then change their model accordingly; however this won't necessarily result in an improvement in generalisation performance as the difference in performance of different models may be largely due to the particular sample of test data that we have. In this case, in making a choice we are effectively over-fitting the test error.

The way to limit this is to make the variance of the test error as small as possible (i.e. the variability in test error we would see if we used different samples of data as the test set, drawn from the same underlying distribution). This is most easily achieved using a large test set if that is possible, or e.g. bootstrapping or cross-validation if there isn't much data available.

I have found that this sort of over-fitting in model selection is a lot more troublesome than is generally appreciated, especially with regard to performance estimation, see

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010 (www)

This sort of problem especially affects the use of benchmark datasets, which have been used in many studies, and each new study is implicitly affected by the results of earlier studies, so the observed performance is likely to be an over-optimistic estimate of the true performance of the method. The way I try to get around this is to look at many datasets (so the method isn't tuned to one specific dataset) and also use multiple random test/training splits for performance estimation (to reduce the variance of the estimate). However the results still need the caveat that these benchmarks have been over-fit.

Another example where this does occur is in machine learning competitions with a leader-board based on a validation set. Inevitably some competitors keep tinkering with their model to get further up the leader board, but then end up towards the bottom of the final rankings. The reason for this is that their multiple choices have over-fitted the validation set (effectively learning the random variations in the small validation set).

If you can't keep a statistically pure test set, then I'm afraid the two best options are (i) collect some new data to make a new statistically pure test set or (ii) make the caveat that the new model was based on a choice made after observing the test set error, so the performance estimate is likely to have an optimistic bias.

Related Question