Machine Learning – Choosing the Best Training-Validation Split

cross-validationmachine learningneural networksvalidation

In machine learning context, suppose I have 100 observations which will be split into training and validation set (say #1 ~ #100) and totally separate 100 observations for test set (say #101 ~ #200).
Suppose there is no order in the observations #1 ~ #100.

I tried 5 different split.
Model 1. #1 ~ #20 as validation set and #21 ~ #100 as training set.
Model 2. #21 ~ #40 as validation and #1 ~ #20, #41 ~ #100 as training set

Model 5. #81 ~ #100 as validation set and #1 ~ #80 as training set.

I fitted 5 machine learning models using the above 5 different split and measured performance (such as RMSE) at the test set (#101 ~ #200).
If I choose a model which shows lowest RMSE at the test set among model 1 ~ model 5 and say this is the 'best split' of observations #1 ~ #100 into training and validation set and use this as a final model, is this a correct argument?

I feel something is wrong in this argument, but cannot logically rebut it.

Best Answer

Yes, your approach is wrong. Here we go with the explanation:

  • As @JanKukaczka already explained, what you did within your 100 training samples is called 5-fold cross validation with consecutive splits.
  • Where things go wrong is when you try to select a "best" split. There are two arguments against doing this at all.

    1. The underlying assumption of cross validation (or, more general, resampling validation) is that the (here: 5) so-called surrogate models you train are equivalent. I.e. their predictions are practically the same for the same (test) sample.
      This equivalence also means that there should be sensible single best surrogate model. If you have true and practically relevant differences between your surrogate models, then your model training procedure is unstable and you cannot rely on it in combination with the given data.
      In addition, observed performance differences between your surrogate models can be due either to model instability as explained above or due to evaluating performance with different test cases (imagine one model is tested with clear "textbook" cases of the classes vs. another surrogate model being tested with borderline cases). Depending on the number of test cases (20 in your scenario) and the performance figure of merit you use this can be a substantial contribution of uncertainty. E.g. comparing accuracy based on 20 test cases requires accuracy differences > 30 % (100% vs. 70%) before the difference can possibly be significant for a single comparison.

    2. argument, without relying on the cross validation concept. Tell us: why should there be a best split and what is its meaning? You are looking at a set of training data of which you give us no reason to assume further clustering/subdivisions. From that, again, any true difference in performance should be spurious as in: due to the accidental choice of what is called sample n.


What happens

If I choose a model which shows lowest RMSE at the test set among model 1 ~ model 5 and say this is the 'best split' of observations #1 ~ #100 into training and validation set and use this as a final model

is that you found training subset of your 100 training cases that accidentally gives a high performance for your test set. We know it is accidental, because exchanging 1/4 of its trainings cases against other training cases of which were inherently assumed to be equivalent to the replaced training cases*. In other words, you are overfitting to your test set.

To put it a bit more bluntly, you are getting close to data-dredging if you try to derive "serious" conclusions directly (though it may be possible to derive a useful permutation test this way)

* If you think the training cases are not equvalent, this means you suspect or know of further subgroups/clusters/confounding factors and you need to account for this in your model.

Related Question