Cross-Validation – How Bad is Hyperparameter Tuning Outside Cross-Validation?

cross-validationhyperparametervalidation

I know that performing hyperparameter tuning outside of cross-validation can lead to biased-high estimates of external validity, because the dataset that you use to measure performance is the same one you used to tune the features.

What I'm wondering is how bad of a problem this is. I can understand how it would be really bad for feature selection, since this gives you a huge number of parameters to tune. But what if you're using something like LASSO (which has only one parameter, the regularization strength), or a random forest without feature selection (which can have a few parameters but nothing as dramatic as adding/dropping noise features)?

In these scenarios, how badly optimistic could you expect your estimate of training error to be?

I'd appreciate any info on this–case studies, papers, anecdata, etc. Thanks!

EDIT: To clarify, I'm not talking about estimating model performance on training data (i.e., not using cross validation at all). By "hyperparameter tuning outside of cross-validation" I mean using cross-validation only to estimate the performance of each individual model, but not including an outer, second cross-validation loop to correct for overfitting within the hyperparameter tuning procedure (as distinct from overfitting during the training procedure). See e.g. the answer here.

Best Answer

The effects of this bias can be very great. A good demonstration of this is given by the open machine learning competitions that feature in some machine learning conferences. These generally have a training set, a validation set and a test set. The competitors don't get to see the labels for either the validation set or the test set (obviously). The validation set is used to determine the ranking of competitors on a leaderboard that everyone can see while the competition is in progress. It is very common for those at the head of the leaderboard at the end of the competition to be very low in the final ranking based on the test data. This is because they have tuned the hyper-parameters for their learning systems to maximise their performance on the leaderboard and in doing so have over-fitted the validation data by tuning their model. More experienced users pay little or no attention to the leaderboard and adopt more rigorous unbiased performance estimates to guide their methodology.

The example in my paper (mentioned by Jacques) shows that the effects of this kind of bias can be of the same sort of size as the difference between learning algorithms, so the short answer is don't used biased performance evaluation protocols if you are genuinely interested in finding out what works and what doesn't. The basic rule is "treat model selection (e.g. hyper-parameter tuning) as an integral part of the model fitting procedure, and include that in each fold of the cross-validation used for performance evaluation).

The fact that regularisation is less prone to over-fitting than feature selection is precisely the reason that LASSO etc. are good ways of performing feature selection. However, the size of the bias depends on the number of features, size of dataset and the nature of the learning task (i.e. there is an element that depends on the a particular dataset and will vary from application to application). The data-dependent nature of this means that you are better off estimating the size of the bias by using an unbiased protocol and comparing the difference (reporting that the method is robust to over-fitting in model selection in this particular case may be of interest in itself).

G. C. Cawley and N. L. C. Talbot (2010), "Over-fitting in model selection and subsequent selection bias in performance evaluation", Journal of Machine Learning Research, 11, p.2079, section 5.2.)

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Nested Cross-Validation – Addressing Inner Loop Overfitting in Nested Cross-Validation

Overfitting in model selection problems for classification is usually due to including too many parameters. Cross-validation or bootstrap error rate estimation should avoid this problem because it avoids the optimism of an estimate like resubstitution which tests the classifier on the same data used in the fit. If you minimize the cross-validated estimate of error rate in your inner loop as the criterion for variable selection you should not have this problem. Am I correct in assuming that your selection procedure does not do this? If so you are probably using a procedure that is biased toward models with many parameters that may be poor models for prediction.

Best Answer

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

Nested Cross-Validation – Addressing Inner Loop Overfitting in Nested Cross-Validation

Related Question