Solved – Stacking/ensembling models with caret

caretensemble learningr

I often find myself training several different predictive models using caret in R. I'll train them all on the same cross validation folds, using caret::: createFolds, then choose the best model based on cross-validated error.

However, the median prediction from several models often outperforms the best single model on an independent test set. I'm thinking of writing some functions for stacking/ensembling caret models that were trained with the same cross-validation folds, for example by taking median predictions from each model on each fold, or by training a "meta-model."

Of course, this might require an outer cross-validation loop. Does anyone know of any existing packages/open source code for ensembling caret models (and possibly cross-validating those ensembles)?

Best Answer

It looks like Max Kuhn actually started working on a package for ensembleling caret models, but hasn't had time to finish it yet. This is exactly what I was looking for. I hope the project gets finished one day!

edit: I wrote my own package to do this: caretEnsemble

Related Solutions

Solved – Train / Validate / Test sets in Caret

Give this a try (modify the details as needed)

library(caret)

library(mlbench)
data(Sonar)

set.seed(1)
splits <- createFolds(Sonar$Class, returnTrain = TRUE)
    results <- lapply(splits, 
                      function(x, dat) {
                        holdout <- (1:nrow(dat))[-unique(x)]
                        data.frame(index = holdout, 
                                   obs = dat$Class[holdout])
                  },
                  dat = Sonar)
mods <- vector(mode = "list", length = length(splits))

## foreach or lapply would do this faster
for(i in seq(along = splits)) {
  in_train <- unique(splits[[i]])
  set.seed(2)
  mod <- train(Class ~ ., data = Sonar[in_train, ],
               method = "svmRadial",
               preProc = c("center", "scale"),
               tuneLength = 8)
  results[[i]]$pred <- predict(mod, Sonar[-in_train, ])
  mods[[i]] <- mod
}

lapply(results, defaultSummary)

Solved – How to evaluate stacking ensemble model vs. other models with 10-fold cross-validation

From what I have seen in Kaggle competitions, it is not exactly how it is done in practice (but it is quite close). Basically, they do cross validation for the second level model and for each CV training set, they use again CV for first level models. It is close to what you have written but your i_A and i_B are drawn by CV.

An example of this use is here, in the out-of-fold predictions code part (but he only applies CV for the first level models).

Then, in this book, p500, they clearly describe how to combine stacking and cross validation. Here is the interesting part:

The steps are described here:

Best Answer

Related Solutions

Solved – Train / Validate / Test sets in Caret

Solved – How to evaluate stacking ensemble model vs. other models with 10-fold cross-validation

Related Question