Model Stacking – Why Ensemble Learning Methods are Effective

ensemble learningmachine learningstacking

Recently, I've become interested in model stacking as a form of ensemble learning. In particular, I've experimented a bit with some toy datasets for regression problems. I've basically implemented individual "level 0" regressors, stored each regressor's output predictions as a new feature for a "meta-regressor" to take as its input, and fit this meta-regressor on these new features (the predictions from the level 0 regressors). I was extremely surprised to see even modest improvements over the individual regressors when testing the meta-regressor against a validation set.

So, here's my question: why is model stacking effective? Intuitively, I would expect the model doing the stacking to perform poorly since it appears to have an impoverished feature representation compared to each of the level 0 models. That is, if I train 3 level 0 regressors on a dataset with 20 features, and use these level 0 regressors' predictions as input to my meta-regressor, this means my meta-regressor has only 3 features to learn from. It just seems like there is more information encoded in the 20 original features that the level 0 regressors have for training than the 3 output features that the meta-regressor uses for training.

Best Answer

Think of ensembling as basically an exploitation of the central limit theorem.

The central limit theorem loosely says that, as the sample size increases, the mean of the sample will become an increasingly accurate estimate of the actual location of the population mean (assuming that's the statistic you're looking at), and the variance will tighten.

If you have one model and it produces one prediction for your dependent variable, that prediction will likely be high or low to some degree. But if you have 3 or 5 or 10 different models that produce different predictions, for any given observation, the high predictions from some models will tend to offset the low errors from some other models, and the net effect will be a convergence of the average (or other combination) of the predictions towards "the truth." Not on every observation, but in general that's the tendency. And so, generally, an ensemble will outperform the best single model.

Related Solutions

Solved – How to properly do stacking/meta ensembling with cross validation

I think the only way to really determine this is to experiment. I made a small one here. I split the dataset in two and trained a model and the stacking model with the same training data at the core. In the 2nd one I trained it on the other half of the data. The accuracy of the second was was higher slightly. However, this could be explained away by the additional data that model gets to see. At the end of the day I think either method will work as long as the underlying models generalize well. It will also depend on how many observations there are to play with, training time, etc.

library(caret)

data("segmentationData")

segmentationData <- segmentationData[,c(-1,-2)]

inTrain = createDataPartition(segmentationData$Class, list = FALSE, p = 0.5)

x.train <- segmentationData[inTrain,]
x.lg <- segmentationData[-inTrain,]

fit.knn <- train(Class ~ ., x.train, method = "knn")
fit.svm <- train(Class ~ ., x.train, method = "svmRadial")

## Train Logistic Regression with same training data
e.train <- data.frame(knn = predict(fit.knn, x.train), svm = predict(fit.svm, x.train), Class = x.train$Class)
fit.lgB <- train(Class ~ ., e.train, method = "glm")

## Train Logistic Regression with different training data
e.train <- data.frame(knn = predict(fit.knn, x.lg), svm = predict(fit.svm, x.lg), Class = x.lg$Class)
fit.lgB <- train(Class ~ ., e.train, method = "glm")


resamps <- resamples(list(diff = fit.lgB, same = fit.lgA))

library(lattice)
bwplot(resamps)

> summary(resamps)

Call:
summary.resamples(object = resamps)

Models: diff, same 
Number of resamples: 25 

Accuracy 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
diff 0.7762  0.8142 0.8262 0.8249  0.8356 0.8753    0
same 0.7865  0.8037 0.8128 0.8148  0.8255 0.8538    0

Kappa 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
diff 0.5019   0.601 0.6273 0.6214  0.6466 0.7257    0
same 0.5380   0.575 0.5917 0.5955  0.6223 0.6776    0

Perhaps use this as a template for your own experiments :)

Solved – Combining bagging and stacking, with and without clusters and heteroskedasticity

Question 1: Why not use Stacking in Random Forests instead of averaging?

Decision trees have high variance and averaging them together reduces the variance, improving the performance. Since decision trees are weak individual models, stacking does not work that well on them. Stacking is best suited for a diverse set of strong models, which themselves can be ensembles (e.g. Random Forests, GBMs, etc).

Question 2: Can you stack clustered (aka "pooled repeated measures") data?

Sure, you can stack clustered data. However, when you use cross-validation to create the "level-one" data (the data to train the metalearner), you should ensure that the rows belonging to a single cluster all stay within a single fold. In your example above, that the rows corresponding to a whole classroom must be contained in a single fold and not be spread out across different folds.

Question 3: What do you do with negative regression coefficients in the stacking regression?

There's nothing inherently wrong with allowing negative weights, however, I've consistently seen better results if you restrict the weights to be non-negative. That's why we choose a GLM with non-negative weights as the default metalearner in the H2O Stacked Ensemble implementation. It's also the default in the SuperLearner R package.

Having a lot of zero weights is not a problem, it probably just means that many of your base learners are not adding value to the ensemble.

Best Answer

Related Solutions

Solved – How to properly do stacking/meta ensembling with cross validation

Solved – Combining bagging and stacking, with and without clusters and heteroskedasticity

Related Question