Solved – How does cross-validation in train (caret) precisely work

caretcross-validationrtrain

I have read quite a number of posts on the caret package and I am specifically interested in the train function. However, I am not completely sure if I have understood correctly how the train function works.

To illustrate my current thoughts I have composed a quick example.

  • First, one specifies a parameter grid. Let's say I use use the method gbm, accordingly the parameter grid for my model could look like:

    grid <- expand.grid( .n.trees=seq(10,50,10), .interaction.depth=seq(1,4,1), .shrinkage=c(0.01,0.001), .n.minobsinnode=seq(5,20,5))
    
  • Subsequently, the control parameters for train (trainControl) are defined. I would like to know if my thoughts on cross-validation using train are correct, and hence, in this example I use the following:

    train_control <- trainControl('cv',10)
    
  • At last, the train function is executed. For example:

    fit <- train(x,y,method="gbm",metric="Kappa",trControl=train_control,tuneGrid=grid)
    

Now the way I presume that train works is the following:

  1. In the above example there are 160 (5*4*2*4) possible parameter combinations
  2. For each parameter combination train performs a 10-fold cross validation
  3. For each parameter combination and for each fold (of the 10 folds) the performance metric (Kappa in my example) is computed (in my example this implies that 1600 Kappa's are computed)
  4. For each parameter combination the mean of the performance metric is computed over the 10 folds
  5. The parameter combination that has the best mean performance metric are considered the best parameters for the model

My question is simple, are my current thoughts correct?

Best Answer

Yes, you are correct. If you want to look at the details:

  • For observing the results over parametrization, and the final model chosen, you can compare fit$results with fit$bestTune and fit$finalModel (with same performance the less complex model is chosen).
  • For observing the performance of the final model parametrization per partition and resample, look at fit$resample. Note that with changing the value for returnResamp in ?trainControl you can configure which results you see here (e.g. if you want to see those also for other than the finally selected parameter set) - but usually the default should be fine.
  • For observing the individual predictions done during CV you can enable savePredictions = T in ?trainControl, then look at fit$pred, e.g. as table(fit$pred$Resample).
Related Question