Machine Learning – Should Final Production Model Be Trained on Complete Data or Just Training Set?

machine learningregression-strategiesvalidation

Suppose I trained several models on training set, choose best one using cross validation set and measured performance on test set. So now I have one final best model. Should I retrain it on my all available data or ship solution trained only on training set? If latter, then why?

UPDATE:
As @P.Windridge noted, shipping a retrained model basically means shipping a model without validation. But we can report test set performance and after that retrain the model on complete data righteously expecting the performance to be better – because we use our best model plus more data. What problems may arise from such methodology?

Best Answer

You will almost always get a better model after refitting on the whole sample. But as others have said you have no validation. This is a fundamental flaw in the data splitting approach. Not only is data splitting a lost opportunity to directly model sample differences in an overall model, but it is unstable unless your whole sample is perhaps larger than 15,000 subjects. This is why 100 repeats of 10-fold cross-validation is necessary (depending on the sample size) to achieve precision and stability, and why the bootstrap for strong internal validation is even better. The bootstrap also exposes how difficult and arbitrary is the task of feature selection.

I have described the problems with 'external' validation in more detail in BBR Chapter 10.

Related Question