Machine Learning – Overfitting During Model Selection: AutoML vs Grid Search

cross-validationhyperparametermachine learningmodel selectionoverfitting

I've recently picked up attention for AutoML algorithms; Meta-algorithms that intelligently search the space of machine learning models to find the "pipeline" (preprocessing/feature selection method/prediction technique etc.) that attains the best predictive power on a certain classification/regression task. Examples are:

  • Auto-sklearn – Using Bayesian optimization
  • TPOT – Using Genetic programming

In my understanding, these techniques evaluate the accuracy of a certain set of pipelines using cross validation, and then perform some kind of directed search across the space of pipelines, based on the obtained results, to find pipelines that achieve a better cross validation error (using the same split in the data). However, using a certain training sample to evaluate a pipeline, then improving the pipeline based on the results, and evaluating on the same sample (although with cross validation, using the same split) feels like upwards biasing the cross validation error as estimator for the true predictive performance, or 'overfitting the selection criterion'. This effect is also described in the paper:
On Over-fitting in Model Selection and Subsequent Selection Bias in
Performance Evaluation

However, in a hypothetical situation where we would have infinite computing power, we could perform a grid search across all different pipeline configurations and parameters, evaluate using cross validation, and pick the one with the best performance. (Although we would need to evaluate the actual performance using for example nested cross-validation) something like this is often done in practice, and not commonly seen as 'overfitting the selection criterion' .

Although both methods could eventually lead to the same performance estimates for the same pipelines and this could select the same pipeline, the first feels like 'overfitting the selection criterion', while the latter does not. I am struggling to see the difference or the similarity between the two. Is the grid search overfitting as well or are the methods fundamentally different? Some views/discussion/insight on this would be appreciated.

Note: i am sure both AutoML algorithms have mechanisms in place to avoid such an overfit, and the purpose of this question is not to discuss a certain specific algorithm, but more the general concepts of overfitting during model selection.

Best Answer

First of all, it is crucial to realize that the overfitting described in the Cawley paper arises from selecting the model with apparently best performance where determining performance is subject to uncertainty. Depending on your field, this uncertainty can be huge, see e.g. our discussion in the context of biomedical spectroscopy:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007

accepted manuscript on arXiv: 1211.1323

So at least for the data I typically encounter, model comparisons are typically like measuring lengths with sub-mm difference with at cm-marked ruler.

However, in a hypothetical situation where we would have infinite computing power, we could perform a grid search across all different pipeline configurations and parameters, evaluate using cross validation, and pick the one with the best performance.

Note that in first approximation, this would not solve the issue of overfitting. For that you'd need to have access to large numbers of cases: enough to make the performance estimation uncertainty negligible compared to the difference between models.

As a second approximation, full grid search may aggravate the problem of overfitting, as far more model comparisons are made, and from a statistical point of view, each of those comes with the risk of making a type I or type II error. From that point of view it is good to have an optimization strategy that gets along with as few comparisons as possible.

A quick glance into the paper about auto-sklearn and its supplementary material reveals that they pick the apparently best models and are thus subject to the said overfitting risk. They do pick an ensemble instead of a single model and argue that this alleviates variance issues - but those models are still selected under the same selection scheme, so I'd not bet on the overfitting issue to be much lower. (I see this similar to the difference in overfitting risk between bagging and boosting)

In any case, you need to do an independent evaluation of the performance of the final model that is outside the optimization. Comparing this performance to the what the optimizer thinks the final model (ensemble)'s performance is should give you an indication whether you did run into overfitting trouble.