Solved – Caret: customizing feature selection, nested inside cross validation

caretcross-validationfeature selection

Using caret, I want to train a SVM classifier and estimate its performance using repeated cross validation. My dataset has a very large number of predictors (300K) and I want to reduce this number using a super simple univariate approach (like t-test p-value below a threshold – or two-class anova is fine too). If I want to customize the filter threshold to use only very significant predictors, I believe this is working for me:

require(caret)

simdata <- twoClassSim(n = 100, linearVars = 300000)

mySBF <- lmSBF
mySBF$filter <- function(score, x, y) { score <= 10e-6 }

fit <- sbf(
  form = Class ~ .,
  data = simdata, 
  method = "svmLinear",
  sbfControl = sbfControl(
    functions = mySBF,
    method = 'repeatedcv',
    number = 4, 
    repeats = 10      
  )
)

But what if my strategy is to rank the predictors by p-value and simply take the top 100? Can anyone suggest a way to accomplish this? I don't see an obvious way to do that, since the functions of sbf appeared to be applied one predictor at a time.

(I may not be using the twoClassSim function correctly — just trying too provide a reproducible example).

Thanks

Best Answer

A few things here:

lmSBF is for linear regression. twoClassSim simulates classification data and you would't want to use a linear regression model for that.
If you want to fit a linear SVM model with method = "svmLinear" you'll need to use caretSBF or write your own fit function. You should give this page a good read since a lot of the information that you want is there.
For SVM classification models, the default ranking of the predictors uses an ANOVA model (see the link above). That means that smaller scores are better. You can use a score function that is TRUE for the 10 smallest scores.

The code below probably does what you want. I didn't tune the model over the cost value but you could if needed.

require(caret)

## For speed, I added 300 informative predictors
set.seed(1)
simdata <- twoClassSim(n = 100, linearVars = 300)

mySBF <- caretSBF
mySBF$filter <- function(score, x, y) rank(score) <= 10

set.seed(2)
fit <- sbf(form = Class ~ .,
           data = simdata, 
           method = "svmLinear",
           trControl = trainControl(method = "none", 
                                    classProbs = TRUE),
           tuneGrid = data.frame(C = 0.25),
           preProc = c("center", "scale"),
           sbfControl = sbfControl(functions = mySBF,
                                   method = 'repeatedcv',
                                   number = 4, 
                                   repeats = 10))

Max

Related Solutions

Solved – Feature selection using caret + repeatedcv

Nobody ever reads the documentation :-/

The package vignette for feature selection had all the details. They can know be found at:

http://caret.r-forge.r-project.org/featureselection.html

in Algorithm #2.

In your case, you have inner resampling to tune the SVM at each iteration (line 2.9 if Algo #2) and an external one to evaluate the number of predictors (line 2.1).

Why does it do this? With small to moderate numbers of instances, a simple partition to a single test set does a very poor job of estimating performance and may very well over-fit to the predictors. [1] concisely summarize this point: ``hold--out samples of tolerable size [...] do not match the cross--validation itself for reliability in assessing model fit and are hard to motivate''.

I would advise reading [2], which reflects how difficult validating feature selection can be. If you have a lot of data, perhaps a single test set would be sufficient.

One other note: you don't show what svmFuncs is exactly, so I don't know how you are estimating variable importance. If you are using the default method, it does the analysis for each predictor independently so using rerank = TRUE is a waste of time (i.e the values will be the same at each calculation).

Max

[1] Hawkins, D. M., Basak, S. C., & Mills, D. (2003). Assessing Model Fit by Cross-Validation. Journal of Chemical Information and Modeling, 43(2), 579–586. doi:10.1021/ci025626i

[2] Ambroise, C., & McLachlan, G. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences, 99(10), 6562–6566.

Solved – Nested cross-validation and feature selection: when to perform the feature selection

Usually, performing feature selection inside the inner loop would be the safer option.

Think about if you are able to tune your feature selection with certain parameters too - like the amount of correlation you allow, information you preserve, or similar. If you want to optimize those, not doing so in the inner loop would likely leave you with an overly optimistic error estimate (as you don't have a separate inner-loop performance estimation anymore). Therefore doing such things in the inner loop and using the outer loop for the final error estimation would usually be the way to go.

Update: I tried to sketch a workflow that I think should be applicable for your problem, in as few steps as possible (see below, I hope I didn't mess anything up). If you want to check out more details, consider reading one of those papers:

Varma & Simon (2006). "Bias in error estimation when using cross-validation for model selection." BMC Bioinformatics, 7: 91

Cawley & Talbot (2010). "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation." Journal of Machine Learning Research, 11: 2079-2107

Do data partitioning (train subjects/test subjects)
Do e.g. repeated CV (leave-subject-out-CV) on train data: For every subject in N:
    Leave out N1
        Leave out N2
            Fit all combinations of feature selection parametrization, hyperparameters, etc. on N-N1-N2
            Evaluate and remember performance for all on the inner left out subject N2
            Evaluate and remember performance for all on the outer left out subject N1
Select "best" parametrization from performance of leaving out N2
Report CV model performance from performance of leaving out N1
Train final model from all training data using chosen "best" parametrization
Test final model - double check that model does what it should

Best Answer

Related Solutions

Solved – Feature selection using caret + repeatedcv

Solved – Nested cross-validation and feature selection: when to perform the feature selection

Related Question