Solved – Train / Validate / Test sets in Caret

caretcross-validationglmnetmachine learningr

I want to use caret to compare two different classification algorithms. For example SVM and Elastic net.

I want to put aside some samples for test set and then use the rest of the samples for training the model, which involves tuning some parameters (like alpha and lambda for elastic net) for which I use cross validation as well. But once I fixed the model, I want to calculate it's performance on the test set (that was not used for parameter tuning).

I want to do 10 fold cross validation to calculate performance of each method (SVM vs Enet). Note that this cross validation should be in an outer loop than the cross validation used for parameter tuning for each method.

I was wondering if I can use Caret to do that.

In the pseudo code given by Caret (http://caret.r-forge.r-project.org/training.html) the first loop (line 2) is for each parameter set, and the next loop (line 3) is for each resampling iteration. Is it possible to add another outer loop such as (Line 0: for each resampling iteration) to this schema ?


EDIT: @topepo: yes we have a limited data, so we want to have multiple test sets. Use 90% data for training (including parameter tuning) and then calculate the performance on the 10% remaining data. then repeat this 10 times with different partitionings of the data.

Best Answer

Give this a try (modify the details as needed)

library(caret)

library(mlbench)
data(Sonar)

set.seed(1)
splits <- createFolds(Sonar$Class, returnTrain = TRUE)
    results <- lapply(splits, 
                      function(x, dat) {
                        holdout <- (1:nrow(dat))[-unique(x)]
                        data.frame(index = holdout, 
                                   obs = dat$Class[holdout])
                  },
                  dat = Sonar)
mods <- vector(mode = "list", length = length(splits))

## foreach or lapply would do this faster
for(i in seq(along = splits)) {
  in_train <- unique(splits[[i]])
  set.seed(2)
  mod <- train(Class ~ ., data = Sonar[in_train, ],
               method = "svmRadial",
               preProc = c("center", "scale"),
               tuneLength = 8)
  results[[i]]$pred <- predict(mod, Sonar[-in_train, ])
  mods[[i]] <- mod
}

lapply(results, defaultSummary)