Machine Learning – Overfitting During Model Selection: AutoML vs Grid Search

cross-validationhyperparametermachine learningmodel selectionoverfitting

I've recently picked up attention for AutoML algorithms; Meta-algorithms that intelligently search the space of machine learning models to find the "pipeline" (preprocessing/feature selection method/prediction technique etc.) that attains the best predictive power on a certain classification/regression task. Examples are:

Auto-sklearn – Using Bayesian optimization
TPOT – Using Genetic programming

In my understanding, these techniques evaluate the accuracy of a certain set of pipelines using cross validation, and then perform some kind of directed search across the space of pipelines, based on the obtained results, to find pipelines that achieve a better cross validation error (using the same split in the data). However, using a certain training sample to evaluate a pipeline, then improving the pipeline based on the results, and evaluating on the same sample (although with cross validation, using the same split) feels like upwards biasing the cross validation error as estimator for the true predictive performance, or 'overfitting the selection criterion'. This effect is also described in the paper:
On Over-fitting in Model Selection and Subsequent Selection Bias in
Performance Evaluation

However, in a hypothetical situation where we would have infinite computing power, we could perform a grid search across all different pipeline configurations and parameters, evaluate using cross validation, and pick the one with the best performance. (Although we would need to evaluate the actual performance using for example nested cross-validation) something like this is often done in practice, and not commonly seen as 'overfitting the selection criterion' .

Although both methods could eventually lead to the same performance estimates for the same pipelines and this could select the same pipeline, the first feels like 'overfitting the selection criterion', while the latter does not. I am struggling to see the difference or the similarity between the two. Is the grid search overfitting as well or are the methods fundamentally different? Some views/discussion/insight on this would be appreciated.

Note: i am sure both AutoML algorithms have mechanisms in place to avoid such an overfit, and the purpose of this question is not to discuss a certain specific algorithm, but more the general concepts of overfitting during model selection.

Best Answer

First of all, it is crucial to realize that the overfitting described in the Cawley paper arises from selecting the model with apparently best performance where determining performance is subject to uncertainty. Depending on your field, this uncertainty can be huge, see e.g. our discussion in the context of biomedical spectroscopy:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

So at least for the data I typically encounter, model comparisons are typically like measuring lengths with sub-mm difference with at cm-marked ruler.

However, in a hypothetical situation where we would have infinite computing power, we could perform a grid search across all different pipeline configurations and parameters, evaluate using cross validation, and pick the one with the best performance.

Note that in first approximation, this would not solve the issue of overfitting. For that you'd need to have access to large numbers of cases: enough to make the performance estimation uncertainty negligible compared to the difference between models.

As a second approximation, full grid search may aggravate the problem of overfitting, as far more model comparisons are made, and from a statistical point of view, each of those comes with the risk of making a type I or type II error. From that point of view it is good to have an optimization strategy that gets along with as few comparisons as possible.

A quick glance into the paper about auto-sklearn and its supplementary material reveals that they pick the apparently best models and are thus subject to the said overfitting risk. They do pick an ensemble instead of a single model and argue that this alleviates variance issues - but those models are still selected under the same selection scheme, so I'd not bet on the overfitting issue to be much lower. (I see this similar to the difference in overfitting risk between bagging and boosting)

In any case, you need to do an independent evaluation of the performance of the final model that is outside the optimization. Comparing this performance to the what the optimizer thinks the final model (ensemble)'s performance is should give you an indication whether you did run into overfitting trouble.

Related Solutions

Solved – Grid search with cross validation: should I keep a separate test set

for the most part i disagree with yukiJ... please read on.

1) procedure almost OK. however, note that you can report the out of sample performance for all models. however, if you then choose the best model based on these numbers, the out of sample performance is again contaminated (you use it for learning the best model family). so I suggest you choose the best model overall using your described procedure, and use your test set then to estimate the performance of the best model. but of course it depends on your interest, are you interested in the performance of the best model (then you should take my suggestion), however if you are interested in the performance of all model families and if you do not nececarily care about the best performance, then your approach is perfect.

2 & 3) I would use nested cross validation. what does this mean?

you make folds for your dataset. lets say 10 fold cross validation. use 5 if it takes too long.

now for each fold, do the following. you construct a trainset and testset. since im telling you for the first fold, lets call them train1, and test1. so you have obtained 2 sets from the complete dataset.

now, you are going to make 10 folds in the dataset train1. so essentially, now you perform cross validation on train1 to determine the best model overall, or the best model from each model family. so in words:

use train1, and for the first fold of train1: make 2 datasets, train1_1 and test1_1 (note, train1_1 is contained in train1, and so iss test1_1). train all models on train1_1 and evaluate them all on test1_1.

then you go to the second fold, so from train1 you make two new datasets train1_2 and test1_2. (note they both are contained in train1).

do the same and evaluate on test1_2.

etc. etc.

then at the end you have evaluated them 10 times, using the dataset train1. now you find the best performing model of each model family (by averaging all test scores), or simply the best model overall (by averaging all test scores). you have to store which model performs the best on the dataset train1 ;). now you have the best performing models, you evaluate them on the still unused testset test1. (remember, test1 is NOT contained in train1).

now you repeat the same for the second fold of the complete dataset. so you form a dataset train2, test2. you again create the following datasets from train2:

train2_1, test2_1

train2_2, test2_2,

...

train2_10, test2_10,

etc. repeat the same procedure. you get the best performing models (which may be different models then you found from the dataset train1! so thats why you need to store which model(s) perform the best). once you have the best performing models, evaluate on test2.

etc. etc. repeat

so in the end, you get this 10 times:

fold 1: model X performed best with out of sample performance Z1

fold 2: model A performed best with ... Z2

...

fold 10: model X performed best with .... Z10.

using this method, you get most out of your data. why?

you see if there is large variance. does the optimal model really depend on the fold, or is it always the same? in the first case the model might be overfitting on the fold. in the latter case you should be happy since the model is performing consistently good, indicating it really does generalize well!
you get the variance of the out of sample performance of the model.
by reusing the data in this way, you do not contaminate folds, yet you get to use your data multiple times.
also, choosing the best model using this way is more robust, since you use averaged test scores to choose models, so in this way you also avoid overfitting during model selection.

if you get large variance numbers you should of course still watch out. you could indeed make plots of the gridsearch in the intermediate model selection steps to see whether or not model selection makes any sense, or maybe your grid is way too small or much too large. if the variance numbers are also very nice, you can also instead of averaging them, you can do the following. lets say you have 2 models, and 4 folds:

         fold 1     fold 2     fold 3  fold 4
model A  2          3          5       2      avg: 3
model B  2.2        3.2        3       2.2    avg: 2.65

lets say that for this performance measure, higher is better. you see on average, model A performs better (looking at the average score). however, for 3 / 4 folds, model B performs better than model A. thus you might say that model B was simply unlucky. so for choosing the best model, you can also use majority voting using the test score. here B would get 3 votes (since it performs the best 3 times) and A gets 1 vote. so after this model selection step, you will select model B. this can be useful if you have high variance. since by averaging you loose a lot of information (since model A and B use the same folds in each cross val step - pay attention that you do use the same folds!) if you use the same folds in each step, you can directly compare these numbers (fold1, fold2).

to reduce variance, you can repeat the whole above procedure again. simply choose new random cross validation folds, and repeat everything. then you can also get a better idea of whether or not your results are robust.

3) typically, more folds are better, but they are a big computational burden, thats why people typically use only 5 or 10 folds. keep in mind with my above described procedure it will take 100 train + test sets! (10 x 10 = 100). there is one downside. if you use more folds, variance of your performance on the test set will increase. since the test set is smaller, you average over less samples. however, at least when you have more folds your training set is bigger, and the model will typically perform better (or closer to the performance you would get if you would use all samples for training).

in the end what matters most is your preference. do you want an accurate estimate of the generalization performance? use less folds, big test sets, your estimate will be closer to the real number. however, the bigger your test set, the worse your model will perform since you have less training samples. so you are not reaching your full potential. so instead, i suggest my method, where you have small test sets, but you average performance of the test sets ;).

if I were you, I would look into how to run your computations in parallel, since it will save you a lot of time. at least be sure to save your results during the process if it takes long, so if you have a crash you do not have to start from scratch.

i did not have a lot of time writing and formatting my answer, please forgive me. if anything is unclear let me know and i will get back to you.

PS: for your case I would use a fairly complex model (assuming classification) I would go for KNN (1 parameter) or SVM with RBF kernel (2 parameters). for regression, perhaps looking into robust regression makes sense, but it depends on your performance measure. lets say you measure performance using MSE, I would simply use regular linear regression plus regularization (L2, L1) and perhaps a kernel (polynomial, RBF).

Solved – Overfitting in Cross Validation for Hyperparameter Selection

The average score of your agent on the validation sets in cross validation should not be significantly better than the score on the holdout (final test) set, because for each cross-validation fold the agent is as blind to the data in the validation set as your final model (presumably retrained on all the training data) is to the holdout data.

Is there some quality of your holdout set that is causing your results to be skewed? Are the examples in the training set easier somehow? If you select a different random holdout set, does the problem persist?

So long as the variance in performance across the folds is low, then so long as the holdout set is statistically similar to the training set, the performance should not degrade. Best practice is to select the parameters that yield the reliably (low variance across folds) best-performing (high mean) cross validation score. I am not aware of any alternative criteria nor arguments why one might prefer them.

Best Answer

Related Solutions

Solved – Grid search with cross validation: should I keep a separate test set

Solved – Overfitting in Cross Validation for Hyperparameter Selection

Related Question