Solved – Recursive feature selection with cross-validation in the caret package (R)

caretcross-validationfeature selectionr

The rfe functions in the caret package allow to perform recursive feature selection (backward) with cross-validation.

It is expected that the best features selected in each fold may differ, as also stated in the caret webpage

Another complication to using resampling is that multiple lists of the
“best” predictors are generated at each iteration. At first this may
seem like a disadvantage, but it does provide a more probabilistic
assessment of predictor importance than a ranking based on a single
fixed data set. At the end of the algorithm, a consensus ranking can
be used to determine the best predictors to retain.

However it is not clear to me how the final "best" set of predictors is chosen in rfe, considering this expected heterogeneity among folds. I cannot find the procedure of the "consensus ranking" mentioned above.

Thank you for you help!

Best Answer

My understanding is the "consensus ranking" is independent of the choosing of the "best" set of predictors. The rfe function finds the best predictors but as far as I know the only place to find the actual algorithm is to go through the source code. I think the author is implying that a "consensus ranking" is up to the user to do something with the variables. For example, running the code example at Feature selection: Using the caret package and showing the results of the random forest predictors:

profile.1$results

  Variables  Accuracy     Kappa  AccuracySD    KappaSD
1         1 0.9968370 0.9936464 0.007392163 0.01485547
2         2 0.9968746 0.9937256 0.009326189 0.01866587
3         3 0.9963217 0.9926185 0.009537048 0.01908711
4         4 0.9971857 0.9943537 0.006409197 0.01284846
5         5 0.9968659 0.9937105 0.007209709 0.01445173
6         6 0.9977209 0.9954207 0.006048051 0.01213925
7        20 0.9954924 0.9909603 0.009642686 0.01930148

profile.2$results

 Variables  Accuracy     Kappa AccuracySD    KappaSD
1         1 0.6483312 0.2995335 0.04698551 0.09230506
2         2 0.7723877 0.5454866 0.03916581 0.07729696
3         3 0.8274992 0.6532635 0.04604503 0.09299738
4         4 0.8388603 0.6762275 0.04361517 0.08828418
5         5 0.8309978 0.6605690 0.04846354 0.09755719
6         6 0.8242424 0.6474883 0.04556598 0.09109094
7        20 0.8005472 0.6018126 0.04871103 0.09703959

profile.3$results

 Variables  Accuracy      Kappa AccuracySD    KappaSD
1         1 0.3192818 0.05197699 0.05773080 0.07663863
2         2 0.3933106 0.13560101 0.05459624 0.07598374
3         3 0.4594806 0.22122750 0.05119101 0.06953943
4         4 0.6771564 0.53076000 0.12127578 0.17285038
5         5 0.6536151 0.49190799 0.07879014 0.11242260
6         6 0.6070402 0.42205418 0.07241226 0.10155747
7        20 0.5046387 0.25116903 0.05869522 0.07952462

profile.4$results

  Variables  Accuracy       Kappa AccuracySD    KappaSD
1         1 0.5154641 0.036353403 0.05806695 0.11057134
2         2 0.5117129 0.032926630 0.06592773 0.12742427
3         3 0.5198731 0.046944007 0.04739288 0.09231161
4         4 0.5187570 0.045917813 0.05237265 0.10100463
5         5 0.5118155 0.032686407 0.05595381 0.10829322
6         6 0.5105693 0.032829544 0.05683679 0.10436906
7        20 0.4972180 0.007899334 0.04944846 0.08724467

A consensus could be calculated on the four results using accuracy or some combinations of metrics.