Solved – How to correctly retrain model using all data, after cross-validation with early stopping

boostingcross-validationneural networks

I have a classification task that doesn't have loads and loads of data, so I'd like to make the most of the data. I have a boosting model and I've performed 5-fold CV, using the validation fold for early stopping. This works reasonably well, but I end up with 5 different estimates of when to stop training, say 100, 120, 80, 70 and 150 rounds. I'd like to retrain my model using all of the available labelled data (i.e. by choosing a good value for the number of rounds to boost for). It's not clear to me what value I should use.

I see three options:

  1. estimate the idea number of rounds as the mean from CV, in the sample above that means training for 104 rounds.

  2. using the max. number of rounds established in CV, i.e. here that would be 150 rounds.

  3. generate predictions at production time by ensembling the 5-models from CV.

I can't find this discussed in the literature – even my friend at Google doesn't know, as they have so much data they never need to worry about this. I'd really appreciate if someone could tell me what's best to use.

Best Answer

Similar thing has been already discussed in the question:

Is epoch optimization in CV with constant mini-batch size even possible?

To summarize the result, you should probably keep a few samples aside and use them as a validation set: The benefit from knowing whether your model is still improving and not overfitting yet is probably going to outweight the benefit of having a few samples more for training.

Also, don't forget that if you change the size of the training set, using "epoch count" stops making sense (see the above thread).

Alternatively, see also OAA mentioned in this answer.