Solved – Should I further tune the model based on results on test set or not

cross-validationmachine learning

I understand that we need to split our data into training, validation and test set – we use training set to train the model, and use cross validation on the validation set to tune the model, and finally, we want to use the set aside testing set that's never seen by the model to get an honest representation of the generalized performance on population or unseen data.

However, I am not sure whether we should further tune our model after getting a test result on the test set, in particular, whether to optimize the model based on the test result based on the test set.

My understanding is, if we see a certain parameter setting seems to lift the test set performance, and as a result we use that parameter setting vs. the previous one, this is "data leakage" – giving knowledge to the model of the data, resulting in overfitting.

On the other hand, if we don't do anything, it doesn't make any sense to use the test set for more than once then, maybe just one single time at the end of everything in terms of model building and evaluation. But again, what if the performance is really bad based on the test set, do we not go back to the model and further test other parameter combinations not already used in cross-validation? If we do, it seems we're coming back to the question in the previous paragraph, we're overfitting…really struggling with this process.

Also, it seems these two posts suggest different solutions. The first post indicates we could further tune based on test result. However, the second post here clearly says we should do nothing! But, again, if we do nothing, that implies we can only use the test set ONE TIME…

Can someone please help me clarify these concerns? Thanks in advance!

Best Answer

what if the performance is really bad based on the test set, do we not go back to the model and further test other parameter combinations not already used in cross-validation?

Of course you'll do that. But the point is that you can use any given data set only in one way: either for measuring generalization performance or fitting the model (and tuning really is nothing else but part of the training process).
So once you start using the next data set for training (regardless wheter you call it tuning, fine-tuning, whatever: if it does influence the model, I'll call it training. Thus training includes selecting appropriate hyperparameters), you'll need to get a still unknown data set for measuring the generalization performance of the new model - or you need to state clearly that you don't have any such data and the reported performance is subject to an optimistic bias.

Whether the later is a viable option IMHO depends a lot on the actual situation: how much bias is expected: did you pick the best of 10000 models on the basis of 25 cases or did you "only" check whether PCA regularization helps and did not find any difference and report that? As a scientific reviewer, I often put more emphasis on the question whether the authors are aware of possible overfitting issues and limit the conclusions correspondingly than requiring an absolutely independent test at a stage where only a few example cases are available that anyways do not cover half of the relevant confounders.

You could e.g. argue that you are in a hypothesis-generating stage than the hypothesis-testing stage. In medical research, you'd maybe do a lot of tuning on incrementally growing data sets, and when you are convinced of your model, you'll anyways have to go and get funding for a double-blinded validation study. In such a situation, it would be a complete waste of resources to demand a fully grown validation at every step.
BUT: this holds only as long as you are aware of the risk of overfitting, and avoid it as much as possible, limit your conclusions AND report all you have done to your data (search term: data dredging).

OTOH, if this is the classifier you're starting to sell as pedestrian recognizing brake assistant for automated driving, not having a proper validation study is no option...

update:

Your choice is basically between

a model with properly estimated generalization performance - which turned out to be too bad to be useful, or
a possibly better model of which you do not know the generalization performance as that model was generated/tuned using also the "test" set which now cannot be used any more for proper estimation of generalization performance.

In order to properly estimate generalization performance of the tuned model, you need to obtain again an test set that is independent of all data used so far.

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Solved – Why splitting the data into the training and testing set is not enough

Even though you are training models exclusively on the training data, you are optimizing hyperparameters (e.g. $C$ for an SVM) based on the test set. As such, your estimate of performance can be optimistic, because you are essentially reporting best-case results. As some on this site have already mentioned, optimization is the root of all evil in statistics.

Performance estimates should always be done on completely independent data. If you are optimizing some aspect based on test data, then your test data is no longer independent and you would need a validation set.

Another way to deal with this is via nested cross-validation, which consists of two cross-validation procedures wrapped around eachother. The inner cross-validation is used in tuning (to estimate the performance of a given set of hyperparameters, which is optimized) and the outer cross-validation estimates generalization performance of the entire machine learning pipeline (i.e., optimizing hyperparameters + training the final model).

Best Answer

Related Solutions

Feature Selection and Cross-Validation – Techniques and Best Practices

Solved – Why splitting the data into the training and testing set is not enough

Related Question