Solved – Understanding cross-validated recursive feature elimination

feature selectionmachine learningoptimization

I want to understand the algorithm of recursive feature eliminiation (RFE) combined with crossvalidation (CV). An original source by Guyon et al. on RFE can be found here.

My understanding of RFE: We train our classifier – say a linear Support Vector Machine – first with all features. This gives us a weight for each feature. The absolute value of these weights reflects the importance of each feature. We remove the least important feature, perform again a training, get a new ranking and continue so on until we have ranked all our features
My question: I am running RFE cross-validated (in python with this implementation). In the example below, several features are ranked first. How does this come about? For a final ranking I assume that the RFE elimination has to be done repeatedly, so does this imply several applications of RFE where each time another feature has been ranked first? How is this combined with cross-validation, and how then is the (see plot below) classification accuracy from 1,2,3,..features calculated, when each of these subsets could consist of different features?

Best Answer

Say you run a 3-fold RFECV. For each split, the train set will be transformed by RFE n times (for each possible 1..n number of features). The classifier supplied will be trained on the training set, and a score will be computed on the test set. Eventually, for each 1..n number of features, the mean result from the 3 different splits is shown on the graph you included. Then, RFEVC transforms the entire set using the best scoring number of features. The ranking you see is based on that final transformation.

Related Solutions

Solved – Recursive feature selection with cross-validation in the caret package (R)

My understanding is the "consensus ranking" is independent of the choosing of the "best" set of predictors. The rfe function finds the best predictors but as far as I know the only place to find the actual algorithm is to go through the source code. I think the author is implying that a "consensus ranking" is up to the user to do something with the variables. For example, running the code example at Feature selection: Using the caret package and showing the results of the random forest predictors:

profile.1$results

  Variables  Accuracy     Kappa  AccuracySD    KappaSD
1         1 0.9968370 0.9936464 0.007392163 0.01485547
2         2 0.9968746 0.9937256 0.009326189 0.01866587
3         3 0.9963217 0.9926185 0.009537048 0.01908711
4         4 0.9971857 0.9943537 0.006409197 0.01284846
5         5 0.9968659 0.9937105 0.007209709 0.01445173
6         6 0.9977209 0.9954207 0.006048051 0.01213925
7        20 0.9954924 0.9909603 0.009642686 0.01930148

profile.2$results

 Variables  Accuracy     Kappa AccuracySD    KappaSD
1         1 0.6483312 0.2995335 0.04698551 0.09230506
2         2 0.7723877 0.5454866 0.03916581 0.07729696
3         3 0.8274992 0.6532635 0.04604503 0.09299738
4         4 0.8388603 0.6762275 0.04361517 0.08828418
5         5 0.8309978 0.6605690 0.04846354 0.09755719
6         6 0.8242424 0.6474883 0.04556598 0.09109094
7        20 0.8005472 0.6018126 0.04871103 0.09703959

profile.3$results

 Variables  Accuracy      Kappa AccuracySD    KappaSD
1         1 0.3192818 0.05197699 0.05773080 0.07663863
2         2 0.3933106 0.13560101 0.05459624 0.07598374
3         3 0.4594806 0.22122750 0.05119101 0.06953943
4         4 0.6771564 0.53076000 0.12127578 0.17285038
5         5 0.6536151 0.49190799 0.07879014 0.11242260
6         6 0.6070402 0.42205418 0.07241226 0.10155747
7        20 0.5046387 0.25116903 0.05869522 0.07952462

profile.4$results

  Variables  Accuracy       Kappa AccuracySD    KappaSD
1         1 0.5154641 0.036353403 0.05806695 0.11057134
2         2 0.5117129 0.032926630 0.06592773 0.12742427
3         3 0.5198731 0.046944007 0.04739288 0.09231161
4         4 0.5187570 0.045917813 0.05237265 0.10100463
5         5 0.5118155 0.032686407 0.05595381 0.10829322
6         6 0.5105693 0.032829544 0.05683679 0.10436906
7        20 0.4972180 0.007899334 0.04944846 0.08724467

A consensus could be calculated on the four results using accuracy or some combinations of metrics.

Recursive Feature Elimination – Choice of Hyper-Parameters for SVM

I will answer my own question for posterity.

In the excellent book Applied Predictive Modeling by Kjell Johnson and Max Kuhn, the RFE algorithm is stated very clearly. It is not stated so clearly (in my opinion) in Guyon et al's original paper. Here it is:

Apparently, the correct procedure is to fully tune and train your model on the original data set, then (using the model) calculate the importances of the variables. Remove $k$ of them, then retrain and tune the model on the feature subset and repeat the process.

I am a python user, so I will speak to that: if you are using sklearn's RFE or RFECV function, this form of the algorithm is not done. Instead, you pass a model (presumably already tuned), and the entire RFE algorithm is performed with that model. While I can offer no formal proof, I suspect that if you use the same model for RFE selection, you will likely overfit your data, and so care should probably be taken when using sklearn's RFE or RFECV functions out of the box. Though, since RFECV does indeed perform cross-validation, it is likely the better choice.

As to whether RFE should be done with an $l_1$ penalty in the model -- I don't know. My anecdotal evidence was that, upon trying this, my model (a linear support vector classifier) did not generalize well. However, this was for a particular data set with some problems. Take that statement with a large spoon of salt.

I will update this post if I learn more.

Best Answer

Related Solutions

Solved – Recursive feature selection with cross-validation in the caret package (R)

Recursive Feature Elimination – Choice of Hyper-Parameters for SVM

Related Question