Solved – Results from rfe function (caret) to compute average metrics – R

auccaretfeature selectionr

I am computing a SVM-RFE model with the rfe function of the caret package, but I am a bit confused about the results. My code is:

fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
svmFuncs <- caretFuncs
svmFuncs$summary <- fiveStats

set.seed(345)
FSctrl <- rfeControl(method = "repeatedcv",
                   repeats = 5,
                   verbose = TRUE,
                   functions = svmFuncs,
                   index = createMultiFolds(TrData[, 1], times = 5),
                   saveDetails = TRUE)

TRctrl = trainControl(method = "LGOCV",
                      number = 50, p = 0.7,
                      savePredictions = TRUE,
                      classProbs = TRUE,
                      verboseIter = FALSE)

set.seed(921)
svmRFE_NG <- rfe(x = TrData[, 2:43],
               y = TrData[, 1],
               sizes = seq(1,42),
               metric = "ROC",
               rfeControl = FSctrl,
               ## Options to train()
               method = "svmLinear",
               tuneGrid = expand.grid(C = 10.^(-2:2)),
               preProc = c("center", "scale"),
               ## Inner resampling process
               trControl = TRctrl)

I would like to compute some average metrics (ROC curve, AUC, sensitivity…) from the cross-validation data (training), but I am not sure where to look at:

svmRFE_NG$pred:

> head(svmRFE_NG$pred)
              pred    BREAST      LUNG    obs Variables    Resample rowIndex
predictions.1 LUNG 0.3075494 0.6924506   LUNG        42 Fold01.Rep1       33
predictions.2 LUNG 0.1106591 0.8893409   LUNG        42 Fold01.Rep1       37
predictions.3 LUNG 0.2504079 0.7495921 BREAST        42 Fold01.Rep1       41
predictions.4 LUNG 0.1174505 0.8825495   LUNG        42 Fold01.Rep1       44
predictions.5 LUNG 0.1238329 0.8761671 BREAST        42 Fold01.Rep1       46
predictions.6 LUNG 0.2917743 0.7082257   LUNG        41 Fold01.Rep1       33

or svmRFE_NG$fit$pred:

> head(svmRFE_NG$fit$pred)
    pred    obs    BREAST      LUNG rowIndex    C   Resample
1 BREAST BREAST 0.7434318 0.2565682        4 0.01 Resample01
2   LUNG   LUNG 0.2731751 0.7268249        6 0.01 Resample01
3   LUNG BREAST 0.4431675 0.5568325        8 0.01 Resample01
4 BREAST BREAST 0.8306861 0.1693139       11 0.01 Resample01
5 BREAST BREAST 0.8404291 0.1595709       15 0.01 Resample01
6   LUNG   LUNG 0.3936469 0.6063531       19 0.01 Resample01

To my knowledge, the final model is stored in svmRFE_NG$fit. Should I take these results (for C = best tuning parameter) or should I work with the svmRFE_NG$pred results (for Variables = optimal size)?

Best Answer

From looking at the RFE examples at Max's page, svmRFE_NG$resample and svmRFE_NG$pred$Resample (and their counterparts in svmRFE_NG$fit), I'd say this depends on which characteristics you want to look at.

svmRFE_NG seems to contain cross validation results of using different variables, so could be used for statistics about using different variables (consider e.g. svmRFE_NG$variables too). Not all information seems to be preserved here though, like the performance of a specific combination of variables, if I didn't just overlook this.

In contrast, svmRFE_NG$fit seems to contain cross validation results for different hyperparameters of the "final model" (the best performing combination of features and hyperparameters). So those can be used for the more classic statistic about the final model you obtained from the whole process.