Solved – Cross Validation versus Ensemble Learning

cross-validationensemble learningmodelregression

After performing $k$-fold cross validation to find the optimal model, and or hyperparameter choices etc, it is common to re-train your (best) proposed model on the full training set, and quote this as your final model.

I am wondering if there is any merit instead in using the previously trained $k$ distinct models as a set of $k$ ensemble predictors, or even instead if we average over the $k$ distinct set of model coefficients (in the example of linear regression).

I feel like intuitively this latter approach gives more of a Bayesian flavour (averaging over model coefficients is like integrating over the space of coefficients, or alternatively the proposed ensemble method will give us $k$ different model predictions and so the prediction will have an associated probability). Whereas the initial approach feels more frequentist.

I am just wondering why then people never seem to use this "Bayesian" approach (loose terminology). Are there any studies to study the difference between these approaches?

Best Answer

Cross-validation isn't a training method but a method for evaluating the model.

Cross-validation for hyperparameter optimization

Typically, when performing CV you'd want to train the same model $k$ times (same exact hyper-parameters) on a slightly different training set. This is done to have a better and more unbiased estimate of the model's performance. Usually, CV is performed for a number of times (let's say $N$ times) in order to select the best hyperparameters for the model.

Note, that during this process none of the models have to be retrained on the full dataset because our goal isn't actually to train a model but to see how well it does for given hyperparameters.

Train the model one last time for predicting

After optimizing the model's hyperparameters (which would require training the different models $k \cdot N$ times), we want to train the model for one final time (as you say on the whole dataset). This is when we could choose to use an ensemble of base models instead of training the best model. However CV doesn't come in to play at this stage.

Practical problem with keeping the $k$ models from CV for creating an ensemble

What you'd be suggesting would require us to store the weights for all $k \cdot N$ models and use the $k$ best to create an ensemble (we would't want to use any of the sub-optimal models at this point, because it would render the whole hyperparameter optimization meaningless). It would, however, be more practical to retrain the best model $k$ times at the end it we wanted to achieve this effect.

Even if we did...

You could use a technique similar to CV for creating an ensemble, through bagging (e.g. like how Random Forests are trained). However, because the $k$ models are exactly the same and are trained on very similar datasets, I feel that it would't have much of a point because they would predict very similar things... I'd like to stress at this point that the strength of CV isn't that you have $k$ models trained on slightly different datasets, but that you evaluate each one on a totally different test set (which makes it much harder to overfit on the validation set). Instead I feel that it would be more meaningful to make an ensemble of somewhat different models. Keep in mind that boosting methods usually outperform bagging.

That being said it would be totally viable to use a process similar to cross-validation for creating $k$ base estimators and ensembling them. However, as I said previously, in my opinion, you have better options.

tl;dr

To sum up, cross-validation is an evaluation process not one for training the models! Instead ensembling isn't concerned about evaluating the models, just training them so that they achieve better results.

Related Question