Solved – Feature selection + classification in Caret

caretmachine learningr

I'm using Caret to apply a bunch of different machine learning algorithms for phenotype prediction from gene expression data. With about 20,000 genes, I'd like to perform filter feature selection before training my classifiers. How can this be achieved? I've been through all the Caret documentation and vignettes, but the sbf and rfe feature selection methods seem to have the classification algorithms built in, for example

output = sbf(x,y,sbfControl = sbfControl(rfSBF, method = "repeatedcv", repeats = 5))

gets me a list of features for each fold and repeat, optimized according to the RF classifier accuracy.

Ideally, I'd like to pass the features selected in each fold along with the indices used into the "train" function, so that I'm only training on a small feature subset for each fold and repeat. The steps I want are:

  1. Split data into 10 folds
  2. Select features using 9 folds
  3. Train model on the same 9 folds and feature subset from step 2
  4. Evaluate performance on remaining fold
  5. Repeat 2-4 over all folds
  6. Repeat 1-5 with different fold splits

The "train" function takes care of everything but step 2 internally, so I can't figure out how to insert the feature selection using the same fold split. I assure you I've been through all the documentation I can find, so any other help would be much appreciated!

Best Answer

You should be able to accomplish everything you want with the sbf function instead. I originally assumed it worked the same way you are, but the functionality given by sbf is apparently more like a super set of what's available in train.

For example, something like this sounds like what you're getting at:

fit <- sbf(
  form = response ~ .,
  data = d, method = "glmnet", 
  tuneGrid=expand.grid(.alpha = .01, .lambda = .1),
  preProc = c("center", "scale"),
  trControl = trainControl(method = "none"),
  sbfControl = sbfControl(functions = caretSBF, method = 'cv', number = 10) 
)

This would run 10 outer folds and fit a single glmnet model to each, using only a feature subset. You could also specify some number of cv folds for trControl and a parameter grid to do training on inner folds.