Solved – Stacking without splitting data

cross-validationensemble learningmachine learningout-of-samplestacking

I learned Stacking used in Ensemble learning. In Stacking, training data is split into two sets. The first set is used for training each model (layer-1, left figure), the second one is used for training of combiner of predictions (layer-2, right figure).

In my project, I have two different multi-classification models. And I have a dataset (train/dev/test) which was used for training and testing two models.
When I have learned Stacking, I thought I tried to use the whole training set for the blending training set (layer-2), then, test the blender with the test data. Though I read the book and other websites, and they mention the training set is splitted into subsets.

Is it uncommon (or not recommended) to use the whole training set for both layer-1 and 2? I thought this is not wrong since test data has been already prepared.
I have already trained my model by the whole training dataset. So if it is not recommended, should I train my models with splitted training dataset?

enter image description here

The images are cited from "Hands on Machine Learning with scikit-learn and Tensorflow." (2017).

Best Answer

You have to have real, out-of-sample, predictions as the input to your blender, otherwise your blender is not learning about, and thereby improving, prediction accuracy - but instead learning about, and thereby improving, in-sample estimation accuracy, which can lead to overfitting. This is why you cannot use the whole training set for both layers - if you do, some of the "predictions" made by the base models will actually be in-sample estimates, not out-of-sample predictions.

You split your training data set so that subset 1 is used to train your base models. This is what is shown in your left picture. Your base models then are used to generate predictions for subset 2, and these predictions, along with the actuals for subset 2, are given to your blender for training. This is what is shown in your right picture. Basically, the predictions are features that are given to your blender, along possibly with other features from subset 2.

The model that the blender comes up with based on subset 2 is then used to predict the test data. This can be done by predicting the test data with each of the base models (developed on subset 1), then predicting the test data again with the blender model (developed on subset 2) + the predictions from the base models. The resultant predictions are the ones you use for calibrating / testing the combined base models + blender model.

Alternatively, you can re-train your base models on subset 1 and 2 prior to making predictions for the test data set. This will tend to improve the base model predictions of the test data set, but (hopefully slightly) weaken the link between the base models and the blender model, as the blender saw less-accurate predictions when it was being trained. The blender will consequently add less value and more overfitting, but given that the base models are more accurate, it may balance out.

ETA (from comments): In practice, I tend to split into more than 2 groups, liking 10 groups for some reason. The base models are then trained with much more data so are more accurate (at least in situations where you don't have overwhelming amounts of data) and the blender is trained on predictions from models that have accuracy characteristics that are closer to what it will see when going operational, which is a win-win in accuracy terms.