Solved – Caret: customizing feature selection, nested inside cross validation

caretcross-validationfeature selection

Using caret, I want to train a SVM classifier and estimate its performance using repeated cross validation. My dataset has a very large number of predictors (300K) and I want to reduce this number using a super simple univariate approach (like t-test p-value below a threshold – or two-class anova is fine too). If I want to customize the filter threshold to use only very significant predictors, I believe this is working for me:

require(caret)

simdata <- twoClassSim(n = 100, linearVars = 300000)

mySBF <- lmSBF
mySBF$filter <- function(score, x, y) { score <= 10e-6 }

fit <- sbf(
  form = Class ~ .,
  data = simdata, 
  method = "svmLinear",
  sbfControl = sbfControl(
    functions = mySBF,
    method = 'repeatedcv',
    number = 4, 
    repeats = 10      
  )
)

But what if my strategy is to rank the predictors by p-value and simply take the top 100? Can anyone suggest a way to accomplish this? I don't see an obvious way to do that, since the functions of sbf appeared to be applied one predictor at a time.

(I may not be using the twoClassSim function correctly — just trying too provide a reproducible example).

Thanks

Best Answer

A few things here:

  • lmSBF is for linear regression. twoClassSim simulates classification data and you would't want to use a linear regression model for that.
  • If you want to fit a linear SVM model with method = "svmLinear" you'll need to use caretSBF or write your own fit function. You should give this page a good read since a lot of the information that you want is there.
  • For SVM classification models, the default ranking of the predictors uses an ANOVA model (see the link above). That means that smaller scores are better. You can use a score function that is TRUE for the 10 smallest scores.

The code below probably does what you want. I didn't tune the model over the cost value but you could if needed.

require(caret)

## For speed, I added 300 informative predictors
set.seed(1)
simdata <- twoClassSim(n = 100, linearVars = 300)

mySBF <- caretSBF
mySBF$filter <- function(score, x, y) rank(score) <= 10

set.seed(2)
fit <- sbf(form = Class ~ .,
           data = simdata, 
           method = "svmLinear",
           trControl = trainControl(method = "none", 
                                    classProbs = TRUE),
           tuneGrid = data.frame(C = 0.25),
           preProc = c("center", "scale"),
           sbfControl = sbfControl(functions = mySBF,
                                   method = 'repeatedcv',
                                   number = 4, 
                                   repeats = 10))

Max