Solved – Combining bagging and stacking, with and without clusters and heteroskedasticity

baggingheteroscedasticitymodel-averagingrandom foreststacking

Question 1:
Start with the classing case of bagging, say in random forest. Fit $B$ trees to bootstrap samples of the data. Average the predictions of the $B$ trees to form a final prediction. Bagging.

Why not use stacking? Form a matrix of predictions of dimension $N\times B$, and regress it on $Y$. This yields a set of weights $w$, which can be used in a weighted average of the individual predictions. Stacking.

Why aren't random forests routinely "stacked?"

Question 2:
A more difficult case: say your dataset has some sort of structure that prompts you to select bootstrap samples from some higher-level cluster. For example, say you are sampling classrooms instead of individual students. So, each tree has whole classrooms in its OOB sample. These can certainly be bagged. But can they be stacked (with good results?)

It seems like heteroskedasticity might mess up stacking in this multilevel context. Say each classroom has a different amount of unexplained (or unexplainable) variance. The trees that do best in the stacking regression will be those that didn't include the classes with big epsilons. So my final meta-model will pretend as if everyone has a small epsilon, thereby understating my uncertainty. Right? What is known on this subject?

One more question, because I am new to stacking
What do you do with negative regression coefficients in the stacking regression? Negative weights? Or do you just exponentiate them and call it done? Or should one use some sort of non-negative least squares optimizer?
I get a lot of zero weights when I do the latter, and it seems like this'd screw up the variance reduction that you're supposed to get through ensembling.

Best Answer

Question 1: Why not use Stacking in Random Forests instead of averaging?

Decision trees have high variance and averaging them together reduces the variance, improving the performance. Since decision trees are weak individual models, stacking does not work that well on them. Stacking is best suited for a diverse set of strong models, which themselves can be ensembles (e.g. Random Forests, GBMs, etc).

Question 2: Can you stack clustered (aka "pooled repeated measures") data?

Sure, you can stack clustered data. However, when you use cross-validation to create the "level-one" data (the data to train the metalearner), you should ensure that the rows belonging to a single cluster all stay within a single fold. In your example above, that the rows corresponding to a whole classroom must be contained in a single fold and not be spread out across different folds.

Question 3: What do you do with negative regression coefficients in the stacking regression?

There's nothing inherently wrong with allowing negative weights, however, I've consistently seen better results if you restrict the weights to be non-negative. That's why we choose a GLM with non-negative weights as the default metalearner in the H2O Stacked Ensemble implementation. It's also the default in the SuperLearner R package.

Having a lot of zero weights is not a problem, it probably just means that many of your base learners are not adding value to the ensemble.

Related Question