Solved – Data snooping in selecting between ML models

machine learningvalidation

If you train a lot of machine learning algorithms on a problem, svm, nn, rf,…via i.e. caret, and I mean a lot of it,like hundreds or thousands (not that hard,considering all parameter tuning in validation), eventually you will find one that will work.

But that is data snooping. As you have trained so much models, I think that you have to test the hypothesis that you had been lucky, maybe with White Reallity Check test or Hansen Superior Predictive Ability test.

Around 99.9% of the papers, articles, posts,…. didn't use White's or Hansen's tests (0.1% are papers regarding stock trading via ML). I suppose this is because, on a normal basis, we use to train a few models(really?).

The question is:

Do you have any idea about the number of models required to consider data snooping terrible effects?

This is, if i'm choosing between 3 models, I think that the possiblity of getting good results by chance is low. But in choosing between 30? 300? 3000?

Best Answer

A lot of effort is underway to address this issue of data dredging with the techniques from differential privacy. The algorithm allows repeated reuse of the test set in evaluating different models, and in the process keeps enough details about the test set hidden that the final model does not overfit.

See the tutorial "Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis" in this yr's ICML

http://icml.cc/2016/?page_id=97

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015, June). Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (pp. 117-126). ACM.

Related Question