Solved – Recursive feature selection with cross-validation in the caret package (R)

caretcross-validationfeature selectionr

The rfe functions in the caret package allow to perform recursive feature selection (backward) with cross-validation.

It is expected that the best features selected in each fold may differ, as also stated in the caret webpage

Another complication to using resampling is that multiple lists of the
“best” predictors are generated at each iteration. At first this may
seem like a disadvantage, but it does provide a more probabilistic
assessment of predictor importance than a ranking based on a single
fixed data set. At the end of the algorithm, a consensus ranking can
be used to determine the best predictors to retain.

However it is not clear to me how the final "best" set of predictors is chosen in rfe, considering this expected heterogeneity among folds. I cannot find the procedure of the "consensus ranking" mentioned above.

Thank you for you help!

Best Answer

My understanding is the "consensus ranking" is independent of the choosing of the "best" set of predictors. The rfe function finds the best predictors but as far as I know the only place to find the actual algorithm is to go through the source code. I think the author is implying that a "consensus ranking" is up to the user to do something with the variables. For example, running the code example at Feature selection: Using the caret package and showing the results of the random forest predictors:

profile.1$results

  Variables  Accuracy     Kappa  AccuracySD    KappaSD
1         1 0.9968370 0.9936464 0.007392163 0.01485547
2         2 0.9968746 0.9937256 0.009326189 0.01866587
3         3 0.9963217 0.9926185 0.009537048 0.01908711
4         4 0.9971857 0.9943537 0.006409197 0.01284846
5         5 0.9968659 0.9937105 0.007209709 0.01445173
6         6 0.9977209 0.9954207 0.006048051 0.01213925
7        20 0.9954924 0.9909603 0.009642686 0.01930148

profile.2$results

 Variables  Accuracy     Kappa AccuracySD    KappaSD
1         1 0.6483312 0.2995335 0.04698551 0.09230506
2         2 0.7723877 0.5454866 0.03916581 0.07729696
3         3 0.8274992 0.6532635 0.04604503 0.09299738
4         4 0.8388603 0.6762275 0.04361517 0.08828418
5         5 0.8309978 0.6605690 0.04846354 0.09755719
6         6 0.8242424 0.6474883 0.04556598 0.09109094
7        20 0.8005472 0.6018126 0.04871103 0.09703959

profile.3$results

 Variables  Accuracy      Kappa AccuracySD    KappaSD
1         1 0.3192818 0.05197699 0.05773080 0.07663863
2         2 0.3933106 0.13560101 0.05459624 0.07598374
3         3 0.4594806 0.22122750 0.05119101 0.06953943
4         4 0.6771564 0.53076000 0.12127578 0.17285038
5         5 0.6536151 0.49190799 0.07879014 0.11242260
6         6 0.6070402 0.42205418 0.07241226 0.10155747
7        20 0.5046387 0.25116903 0.05869522 0.07952462

profile.4$results

  Variables  Accuracy       Kappa AccuracySD    KappaSD
1         1 0.5154641 0.036353403 0.05806695 0.11057134
2         2 0.5117129 0.032926630 0.06592773 0.12742427
3         3 0.5198731 0.046944007 0.04739288 0.09231161
4         4 0.5187570 0.045917813 0.05237265 0.10100463
5         5 0.5118155 0.032686407 0.05595381 0.10829322
6         6 0.5105693 0.032829544 0.05683679 0.10436906
7        20 0.4972180 0.007899334 0.04944846 0.08724467

A consensus could be calculated on the four results using accuracy or some combinations of metrics.

Related Solutions

Solved – the best strategy to train and validate classification using PLS-[classifier] in caret package

I don't have time right now to answer all your questions, but here's a start:

yes you can optimize whatever hyperparameters you have to optimize with the same optimization set. The important thing is to make sure that the test set used for the final measurement of prediction performance is kept independent of all kinds of training data, and the optimization set is part of the training data.
See nested or double cross/resampling/out-of-bootstrap validation for search terms.
Model selection as in comparing a number of models and then keeping the "best" is a data-driven optimization. It belongs into the optimization stage, and in order to get the performance of the chosen model, you need independent test data!
As you already say that your sample size is quite small (though I'd be happy to have that many patients, the more so, as I have far more variates...), it is probably better to restrict yourself to not spend samples on optimization - the more so, as you probably cannot do any meaningful model comparison with that sample size anyways. And optimization (as in "pick the best") usually boils down to a massive multiple comparison situation.
bootstrap vs. cross validation: boostrap is preferred by some, cross validation by other disciplines. For my type of data, iterated k-fold cross validation and out-of-bootstrap had very similar total error:
Beleites, C. et al.: Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
See also Kim, J.-H.: Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , Computational Statistics & Data Analysis , 53, 3735 - 3745 (2009). DOI: 10.1016/j.csda.2009.04.009 for similar findings.
Comparing different PLS surrogate models generated during resampling validation:
- Centering: you don't have to mean center your data. That is just a default. One alternative that would be more stable across the surrogate models would be if among your classes you have e.g. a control group. Then the mean of the controls would make a nice center which is also biologically more meaningful.
- The same for scaling: variance scaling again is just some default procedure. But if you can derive a meaningful scaling by external knowledge that is fine, and possibly even constant for all surrogate models.
- Even if you stay with mean centering and variance scaling: you can compare the mean and scaling vectors of the different surrogate models. If they are not reasonably stable, that is a strong indicator that your sample size is too small to derive a meaningful predictive model.
- Once the center and scale is (approximately) constant, and if your dummy coding of the dependent stays the same, then (unlike PCA loadings and the PLS X loadings) the PLS coefficients should be stable as well. LDA or PLS-LDA final coefficients have some degrees of freedom (flipping and rotation) that don't affect the prediction. Therefore you can and should align the surrogate models accordingly to avoid seeing meaningless variation. See Beleites, C. et al.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information, Anal Bioanal Chem, 400, 2801-2816 (2011). DOI: 10.1007/s00216-011-4985-4 for more explanations and an example.
Q4: such constraints are known as stratification.
as for caret and PLS preprocessing, maybe a glance at the code for PCA pre-processing allows you to define a PLS preprocessing? Otherwise, you'll probably have to set up custom PLS-xxx models

Solved – Caret: customizing feature selection, nested inside cross validation

A few things here:

lmSBF is for linear regression. twoClassSim simulates classification data and you would't want to use a linear regression model for that.
If you want to fit a linear SVM model with method = "svmLinear" you'll need to use caretSBF or write your own fit function. You should give this page a good read since a lot of the information that you want is there.
For SVM classification models, the default ranking of the predictors uses an ANOVA model (see the link above). That means that smaller scores are better. You can use a score function that is TRUE for the 10 smallest scores.

The code below probably does what you want. I didn't tune the model over the cost value but you could if needed.

require(caret)

## For speed, I added 300 informative predictors
set.seed(1)
simdata <- twoClassSim(n = 100, linearVars = 300)

mySBF <- caretSBF
mySBF$filter <- function(score, x, y) rank(score) <= 10

set.seed(2)
fit <- sbf(form = Class ~ .,
           data = simdata, 
           method = "svmLinear",
           trControl = trainControl(method = "none", 
                                    classProbs = TRUE),
           tuneGrid = data.frame(C = 0.25),
           preProc = c("center", "scale"),
           sbfControl = sbfControl(functions = mySBF,
                                   method = 'repeatedcv',
                                   number = 4, 
                                   repeats = 10))

Max

Best Answer

Related Solutions

Solved – the best strategy to train and validate classification using PLS-[classifier] in caret package

Solved – Caret: customizing feature selection, nested inside cross validation

Related Question