Solved – k-fold Cross validation of ensemble learning

classificationcross-validationensemble learning

I am confused about how to partition the data for k-fold cross validation of ensemble learning.

Assuming I have an ensemble learning framework for classification. My first layer contains the classification models, e.g. svm, decision trees.

My second layer contains a voting model, which combines the predictions from the first layer and gives the final prediction.

If we use 5 fold-cross validation, I am thinking of using the 5 folds as follows:

  • 3 folds for training first layer
  • 1 fold for training second layer
  • 1 fold for testing

Is this the correct way? Should the training data for the first and second layer be independent? I am thinking they should be independent so that the ensemble learning framework will be robust.

My friend suggests the training data for the first and second layer should be the same, i.e.

  • 4 folds for training first and second layer
  • 1 fold for testing

In this way, we will have a more accurate error of the ensemble learning framework, and the iteratively tuning of the framework will be more accurate, as it is based on a single training data. Moreover, the second layer may be bias towards the independent training data

Any advices are greatly appreciated

Best Answer

Ensemble learning refers to quite a few different methods. Boosting and bagging are probably the two most common ones. It seems that you are attempting to implement an ensemble learning method called stacking. Stacking aims to improve accuracy by combining predictions from several learning algorithms. There are quite a few ways to do stacking and not a lot of rigorous theory. It's intuitive and popular though.

Consider your friend's approach. You are fitting the first layer models on four out of five folds and then fitting the second layer (voting) model using the same four folds. The problem is that the second layer will favor the model with the lowest training error. You are using the same data to fit models and to devise a procedure to aggregate those models. The second layer should combine the models using out-of-sample predictions. Your method is better, but there is a way to do even better still.

We'll continue to leave out one fold for testing purposes. Take the four folds and use 4-fold CV to get out-of-sample predictions for each of your first layer models on all four folds. That is, leave out one of four folds and fit the models on the other three and then predict on the held-out data. Repeat for all four folds so you get out-of-sample predictions on all four folds. Then fit the second layer model on these out-of-sample predictions. Then fit the first layer models again on all four folds. Now you can go to the fifth fold that you haven't touched yet. Use the first layer models fit on all four folds along with the second layer model to estimate the error on the held-out data. You can repeat this process again with the other folds held out of the first and second layer model fitting.

If you are satisfied with the performance then generate out-of-sample predictions for the first layer models on all five folds and then fit the second layer model on these. Then fit the first layer models one last time on all your data and use these with the second layer model on any new data!

Finally, some general advice. You'll get more benefit if your first layer models are fairly distinct from each other. You are on the right path here using SVM and decision trees, which are pretty different from each other. Since there is an averaging effect from the second layer model, you may want to try overfitting your first layer models incrementally, particularly if you have a lot of them. The second layer is generally something simple and constraints like non-negativity of weights and monotonicity are common. Finally, remember that stacking relies on cross-validation, which is only an estimate of the true risk. If you get very different error rates and very different model weights across folds, it indicates that your cv-based risk estimate has high variance. In that case, you may want to consider a simple blending of your first layer models. Or, you can compromise by stacking with constraints on the max/min weight placed on each first layer model.