Solved – Feature selection using caret + repeatedcv

caretcross-validationfeature selectionr

I am using caret and repeatedcv with repeats for feature selection. That is,

rfeControl(functions = svmFuncs, method = "repeatedcv", number = 10, repeats = 5,  
           rerank = TRUE, returnResamp = "all", saveDetails = FALSE, verbose = TRUE)

I am quite confused about the way rfeControl splits the input data using repetition. In general, if I am not mistaken, the most unbiased way of assessing the performance of the model is to:

  1. iteratively create 2 subsets (test and training set)
  2. do the validation to the training set (i.e. cross validation) and select the most significant predictors
  3. assess the performance with the unknown test set

In case of rfeControl with repeated-cv and repetition, the repetition is applied from (1) or during the validation process (2)?

Best Answer

Nobody ever reads the documentation :-/

The package vignette for feature selection had all the details. They can know be found at:

http://caret.r-forge.r-project.org/featureselection.html

in Algorithm #2.

In your case, you have inner resampling to tune the SVM at each iteration (line 2.9 if Algo #2) and an external one to evaluate the number of predictors (line 2.1).

Why does it do this? With small to moderate numbers of instances, a simple partition to a single test set does a very poor job of estimating performance and may very well over-fit to the predictors. [1] concisely summarize this point: ``hold--out samples of tolerable size [...] do not match the cross--validation itself for reliability in assessing model fit and are hard to motivate''.

I would advise reading [2], which reflects how difficult validating feature selection can be. If you have a lot of data, perhaps a single test set would be sufficient.

One other note: you don't show what svmFuncs is exactly, so I don't know how you are estimating variable importance. If you are using the default method, it does the analysis for each predictor independently so using rerank = TRUE is a waste of time (i.e the values will be the same at each calculation).

Max

[1] Hawkins, D. M., Basak, S. C., & Mills, D. (2003). Assessing Model Fit by Cross-Validation. Journal of Chemical Information and Modeling, 43(2), 579–586. doi:10.1021/ci025626i

[2] Ambroise, C., & McLachlan, G. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.