Solved – K-fold cross-bagging

baggingcross-validationensemble learningmachine learningpredictive-models

Typically when one does cross-validation, one fits $K$ models of the same form over a grid of a hyperparameter. One selects the hyperparameter that minimizes the prediction error out-of-sample averaged over the $K$ folds. With that estimated optimal hyperparameter, one re-fits one's model to the whole dataset. That model is then used for prediction.

This works well for regularized linear regression, because there is usually (always?) a minimum to the out-of-sample prediction error with respect to the hyperparameter.

But some algorithms don't have this guarantee, and/or they are just really noisy, and/or there are multiple hyperparameters, and/or the structure of the data means that some folds might be distributed differently than other folds.

For example, I'm working with panel data with multiple observations of individuals over years, and tuning my model to predict new years out-of-sample. So I've got $N$ observations and $n$ years, with $N>>n$. Different years have fairly different realizations of the covariates, which leads to optimal hyperparameters being pretty different in different folds. Plus I'm fitting neural nets, which are really noisy and where convergence to a global minimum is impractical and/or ill-advised.

So lately I've been combining bagging with cross-validation. The algorithm is:

  1. Divide the data into $K$ folds
  2. For each fold, fit the model to the not-$k$th subset over the grid of the hyperparameter. Determine the optimal hyperparameter for that fold by predicting against the $k$th data subset
  3. Save $K$ different models, each with different hyperparameters
  4. At prediction time, $\hat y = E[m_k(X)]$ where $m$ is the selected model for that fold.

I get fairly good performance in my application, as measured by out-of-bag error.

The approach seems fairly intuitive, but I've invented it myself (i.e.: I'm not aware of others who have studied this approach). Is there reason to believe that it has drawbacks in certain settings that I should be cautious about? Or are there other approaches to this general class of problems?

Edit:
As frequently happens, a linked "related" question provides some insight:
A comment on this question links to this paper, which argues — in the context of many many bootstrap samples — that one should select the level of the hyperparameter using cross-validation, and given that hyperparameter go back and re-do the bagging. It isn't immediately clear how well this would work in the context of only $K$ bootstrap samples (which are the same as the $K$ folds.)

Best Answer

You are not limited by K bootstrap samples. You may create sampled subsets out of your training set as many times as you wish – 100 times is considered to be enough. For every bag you get predictions for the test set, then you average them. And this is your output for the k-th fold. You repeat the procedure K times and average again to see the predictive power of your model. A bagging loop within a CV loop.

The reason why we use CV is to reduce overfitting. If you only split your data train/val one time and then will search the best hyperparameters, you will likely overfit if your dataset is too small for the used architecture. What I see you doing in your scheme is to overfit K times and average these bad fits to get something new. I doubt if will be any better than... well, average.

What we do in CV is to use the same architecture on different subsets and average their models' predictions. The aim is to get a better idea about this particular architecture than we could have using a single train/val split.

But since your data is very small, there is a need of further decrease in test variance. Hence bagging within every fold.

upd What you may do is to have two CV loops. One for hyperparameters tuning, and one for the resulting architecture evaluation.