Solved – Train / Validate / Test sets in Caret

caretcross-validationglmnetmachine learningr

I want to use caret to compare two different classification algorithms. For example SVM and Elastic net.

I want to put aside some samples for test set and then use the rest of the samples for training the model, which involves tuning some parameters (like alpha and lambda for elastic net) for which I use cross validation as well. But once I fixed the model, I want to calculate it's performance on the test set (that was not used for parameter tuning).

I want to do 10 fold cross validation to calculate performance of each method (SVM vs Enet). Note that this cross validation should be in an outer loop than the cross validation used for parameter tuning for each method.

I was wondering if I can use Caret to do that.

In the pseudo code given by Caret (http://caret.r-forge.r-project.org/training.html) the first loop (line 2) is for each parameter set, and the next loop (line 3) is for each resampling iteration. Is it possible to add another outer loop such as (Line 0: for each resampling iteration) to this schema ?

EDIT: @topepo: yes we have a limited data, so we want to have multiple test sets. Use 90% data for training (including parameter tuning) and then calculate the performance on the 10% remaining data. then repeat this 10 times with different partitionings of the data.

Best Answer

Give this a try (modify the details as needed)

library(caret)

library(mlbench)
data(Sonar)

set.seed(1)
splits <- createFolds(Sonar$Class, returnTrain = TRUE)
    results <- lapply(splits, 
                      function(x, dat) {
                        holdout <- (1:nrow(dat))[-unique(x)]
                        data.frame(index = holdout, 
                                   obs = dat$Class[holdout])
                  },
                  dat = Sonar)
mods <- vector(mode = "list", length = length(splits))

## foreach or lapply would do this faster
for(i in seq(along = splits)) {
  in_train <- unique(splits[[i]])
  set.seed(2)
  mod <- train(Class ~ ., data = Sonar[in_train, ],
               method = "svmRadial",
               preProc = c("center", "scale"),
               tuneLength = 8)
  results[[i]]$pred <- predict(mod, Sonar[-in_train, ])
  mods[[i]] <- mod
}

lapply(results, defaultSummary)

Related Solutions

Solved – Stacking/ensembling models with caret

It looks like Max Kuhn actually started working on a package for ensembleling caret models, but hasn't had time to finish it yet. This is exactly what I was looking for. I hope the project gets finished one day!

edit: I wrote my own package to do this: caretEnsemble

Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

train doesn't save the model information within a fold. You can save the models out to the file system using a custom model:

glmn_funcs <- getModelInfo("glmnet", regex = FALSE)[[1]]
glmn_funcs$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
    theDots <- list(...)  
    if(all(names(theDots) != "family")) theDots$family <- "multinomial"   
    modelArgs <- c(list(x = as.matrix(x), y = y, alpha = param$alpha),
                        theDots)

    out <- do.call("glmnet", modelArgs) 
    if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
        save(out, file = paste("~/tmp/glmn", param$alpha, 
                          floor(runif(1, 0, 1)*100), ## to help uniqueness
                          format(Sys.time(), "%H_%M_%S.RData"),
                          sep = "_")
    out 
  }

model <- train(x = iris[,-5],
               y = iris$Species,
               method = glmn_funcs,
               type.gaussian = "naive",
               tuneGrid = grid,
               trControl = ctrl,
               preProc = c("center", "scale"))

You can use the coef function on each model to get the slopes. Note that train did not fit all possible models, which is

> length(model$control$index)*nrow(grid)
[1] 5500

(omitting the one for the final model). It fits one per unique alpha per fold:

> length(unique(grid$.alpha))*length(model$control$index)
[1] 275
> length(list.files("~/tmp", pattern = "glmn_")) ##includes the final model
[1] 276

So you will have to do some looping using something like:

> params <- coef(out, s = unique(grid$.lambda), type = "nonzeo")
    > names(params) ## a matrix per class
    [1] "setosa"     "versicolor" "virginica" 
    > lapply(params, dim)
    $setosa
[1]  5 20

$versicolor
[1]  5 20

$virginica
[1]  5 20

Lastly, you don't need to prefix a period before the parameter names using recent versions of caret.

Max

Best Answer

Related Solutions

Solved – Stacking/ensembling models with caret

Solved – Feature Importance in each fold and repeat after repeated cross validation in caret

Related Question