Model Stacking – Why Ensemble Learning Methods are Effective

ensemble learningmachine learningstacking

Recently, I've become interested in model stacking as a form of ensemble learning. In particular, I've experimented a bit with some toy datasets for regression problems. I've basically implemented individual "level 0" regressors, stored each regressor's output predictions as a new feature for a "meta-regressor" to take as its input, and fit this meta-regressor on these new features (the predictions from the level 0 regressors). I was extremely surprised to see even modest improvements over the individual regressors when testing the meta-regressor against a validation set.

So, here's my question: why is model stacking effective? Intuitively, I would expect the model doing the stacking to perform poorly since it appears to have an impoverished feature representation compared to each of the level 0 models. That is, if I train 3 level 0 regressors on a dataset with 20 features, and use these level 0 regressors' predictions as input to my meta-regressor, this means my meta-regressor has only 3 features to learn from. It just seems like there is more information encoded in the 20 original features that the level 0 regressors have for training than the 3 output features that the meta-regressor uses for training.

Best Answer

Think of ensembling as basically an exploitation of the central limit theorem.

The central limit theorem loosely says that, as the sample size increases, the mean of the sample will become an increasingly accurate estimate of the actual location of the population mean (assuming that's the statistic you're looking at), and the variance will tighten.

If you have one model and it produces one prediction for your dependent variable, that prediction will likely be high or low to some degree. But if you have 3 or 5 or 10 different models that produce different predictions, for any given observation, the high predictions from some models will tend to offset the low errors from some other models, and the net effect will be a convergence of the average (or other combination) of the predictions towards "the truth." Not on every observation, but in general that's the tendency. And so, generally, an ensemble will outperform the best single model.