Scikit-Learn – Why Estimators in Stacking Are Fitted on Entire Training Data

ensemble learningscikit learnstacking

In chapter 7 of "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow", the first step of stacking method is spliting the train data into two subsets. The first subset is used to train the predictors in the first layer. Next, the first layer’s predictors are used to make predictions on the second (held-out) set and then create a new training set using these predicted values as input features. Use this new training set to train a blender:


But in Stacked generalization part of scikit-learn: "During training, the estimators are fitted on the whole training data X_train. They will be used when calling predict or predict_proba", it seems that scikit-learn use all data to train the "first layer" predictors. Why scikit-learn use this setting?

Best Answer

For an exact answer to the why question you would need to ask the scikit-learn developers about their motivation, we could only guess. In general, stacking means training several models on the data and having a meta-model that is trained on the predictions. Aurélien Géron mentions that it is a "common approach is to use a hold-out set", but it is not the only way. Another possible approach is to use cross-validation, in fact, this is what scikit-learn does.

Why did they decide to use cross-validation rather than two separate sets? Cross-validation re-uses the data, so it is better suitable for smaller datasets. For using two sets, you would need a large dataset so that each of the subsets would be big enough to train a model. It seems to be a safer default for general use software, while as noticed by Aurélien Géron, the other approach can be easily implemented if you want to use it.

Related Question