Machine Learning – Can Overfitting Occur by Training Machine Learning Algorithms Using CV/Bootstrap?

bootstrapcross-validationmachine learningoptimizationresampling

This question may well be too open-ended to get a definitive answer, but hopefully not.

Machine learning algorithms, such as SVM, GBM, Random Forest etc, generally have some free parameters that, beyond some rule of thumb guidance, need to be tuned to each data set. This is generally done with some kind of re-sampling technique (bootstrap, CV etc) in order to fit the set of parameters that give the best generalisation error.

My question is, can you go too far here? People talk about doing grid searches as so forth, but why not simply treat this as an optimisation problem and drill down to the best possible set of parameters? I asked about some mechanics of this in this question, but it hasn't received much attention. Maybe the question was badly asked, but perhaps the question itself represents a bad approach that people generally do not do?

What bothers me is the lack of regularisation. I might find by re-sampling that the best number of trees to grow in a GBM for this data set is 647 with an interaction depth of 4, but how sure can I be that this will be true of new data (assuming the new population is identical to the training set)? With no reasonable value to 'shrink' to (or if you will, no informative prior information) re-sampling seems like the best we can do. I just don't hear any talk about this, so it makes me wonder if there is something I'm missing.

Obviously there is a large computational cost associated with doing many many iterations to squeeze every last bit of predictive power out of a model, so clearly this is something you would do if you've got the time/grunt to do the optimisation and every bit of performance improvement is valuable.

Best Answer

There is a definitive answer to this question which is "yes, it is certainly possible to overfit a cross-validation based model selection criterion and end up with a model that generalises poorly!". In my view, this appears not to be widely appreciated, but is a substantial pitfall in the application of machine learning methods, and is the main focus of my current research; I have written two papers on the subject so far

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www)

which demonstrates that over-fitting in model selection is a substantial problem in machine learning (and you can get severely biased performance estimates if you cut corners in model selection during performance evaluation) and

G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www)

where the cross-validation based model selection criterion is regularised to try an ameliorate over-fitting in model selection (which is a key problem if you use a kernel with many hyper-parameters).

I am writing up a paper on grid-search based model selection at the moment, which shows that it is certainly possible to use a grid that is too fine where you end up with a model that is statistically inferior to a model selected by a much coarser grid (it was a question on StackExchange that inspired me to look into grid-search).

Hope this helps.

P.S. Unbiased performance evaluation and reliable model selection can indeed be computationally expensive, but in my experience it is well worthwhile. Nested cross-validation, where the outer cross-validation is used for performance estimation and the inner crossvalidation for model selection is a good basic approach.

Best Answer

Related Solutions

Machine Learning – How to Obtain Optimal Hyperparameters After Nested Cross Validation

Overview

Tuning and validation (inner and outer resampling loops)

The final model

Regression – When to Apply Bootstrap to Time Series Models

Stochastic Processes

Bootstrapping

Bootstrapping Time Series

References

Related Question