Solved – way to return the standard error of cross-validation predictions using caret `train`

In the book Applied Predictive Modelling Ch 4., there is the following table:

The standard error here is used in the following graph, and to use the "one-standard error method" to find the optimal cost variable:

However, when I examine the results table of the train object, I only get the following table which shows the standard deviation.:

         sigma      C Accuracy     Kappa AccuracySD    KappaSD
1  0.008741401   0.25  0.75625 0.3921076 0.03865154 0.09231880
2  0.008741401   0.50  0.75775 0.3943975 0.03960020 0.09685408
3  0.008741401   1.00  0.76050 0.3916946 0.04019722 0.09880242
4  0.008741401   2.00  0.75900 0.3711735 0.03607320 0.09550139
5  0.008741401   4.00  0.76050 0.3694723 0.03556756 0.09306441
6  0.008741401   8.00  0.75025 0.3356897 0.03530931 0.09649264
7  0.008741401  16.00  0.73350 0.2796838 0.02906274 0.08092828
8  0.008741401  32.00  0.73450 0.2753388 0.03149749 0.09526107
9  0.008741401  64.00  0.73300 0.2680479 0.03338474 0.09668607
10 0.008741401 128.00  0.72725 0.2477570 0.03700265 0.10970571

I can divide AccuracySD by the square root of the number of folds * repetitions (which I think is the right se calc, this is not explicitly stated in the book), but this is not easy to generalize when running many candidate CV methods. Is there any way to extract either n or se from the train object?

library(caret) data(GermanCredit) GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)] GermanCredit$CheckingAccountStatus.lt.0 <- NULL GermanCredit$SavingsAccountBonds.lt.100 <- NULL GermanCredit$EmploymentDuration.lt.1 <- NULL GermanCredit$EmploymentDuration.Unemployed <- NULL GermanCredit$Personal.Male.Married.Widowed <- NULL GermanCredit$Property.Unknown <- NULL GermanCredit$Housing.ForFree <- NULL set.seed(100) inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]] GermanCreditTrain <- GermanCredit[ inTrain, ] GermanCreditTest <- GermanCredit[-inTrain, ] library(kernlab) set.seed(231) sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1) svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7)) set.seed(1056) svmFit10CV <- train(Class ~ ., data = GermanCreditTrain, method = "svmRadial", preProc = c("center", "scale"), tuneGrid = svmTuneGrid, trControl = trainControl(method = "cv", number = 10)) #create graph based on train$control and train$results objects library(dplyr) svmFit10CV$results %>% mutate(accuracySD_low = Accuracy - 2*(AccuracySD/sqrt(svmFit10CV$control$number * svmFit10CV$control$repeats)), accuracySD_high = Accuracy + 2*(AccuracySD/sqrt(svmFit10CV$control$number * svmFit10CV$control$repeats))) %>% ggplot(aes(x = C)) + geom_line(aes(y = Accuracy)) + geom_point(aes(y = Accuracy)) + scale_x_log10() + #correct spacing of the cost parameter ylim(0.65, 0.8) + #set correct y-axis geom_errorbar(aes(ymin=accuracySD_low, ymax=accuracySD_high), colour="gray50", width=.1) + labs(title="Estimates of prediction accuracy\nwith 2 SD errror bars")

Best Answer

You can use all of the options available in the train object. The number of folds and repeats are available in train$control The accuracy numbers are in train$results

I have taken the example and code out of the book and recreated the picture for 10 CV with the following code using dplyr and ggplot2.

Best Answer

Related Solutions

Solved – R/caret: train and test sets vs. cross-validation

Solved – Using partial AUC as Caret metric for cross-validation

Related Question