Recursive Feature Elimination – Choice of Hyper-Parameters for SVM

feature selectionmachine learningpredictive-modelsscikit learnsvm

Suppose that I am interested in applying Recursive Feature Elimination (such as RFE() in sklearn), and my data set is of medium size, with $n>p$, where $p$ is the number of features. Let's say I have reason to believe that the data is relatively noisy, and I believe that feature selection may improve classification accuracy.

If I wanted to implement RFE with a support vector classifier (say SVC() with a linear kernal), the model must be trained, then passed to RFE(), which then can find a subset of the features to use in the final model. Then, do we retrain the SVC on the dataset using the subset of the features? So my question:

Is this the correct interpretation? If so, then how do we address the tuning of hyperparameters within the model? (Linear) SVC's use the parameter $C$ to control the amount of regularization which can drastically affect the bias/variance tradeoff. Thus, do we optimize $C$ first, then pass the model to RFE(), derive best subset feature selection, then fit the same model on the subsetted features? Or do we choose an arbitrary $C$ (say 1), pass to RFE(), then train and tune a final model on the remaining features? Of course, all tuning is being done with cross-validation in this scenario.

The former option seems more correct to me, but I also think that we are missing out on some possible improvements in accuracy by not tuning the model again on the subset of the features. (Perhaps two tunings? Before RFE and after RFE?)

Can we generalize this procedure to other classifiers? What about if an $l_1$ penalty is available?

Best Answer

I will answer my own question for posterity.

In the excellent book Applied Predictive Modeling by Kjell Johnson and Max Kuhn, the RFE algorithm is stated very clearly. It is not stated so clearly (in my opinion) in Guyon et al's original paper. Here it is:

enter image description here

Apparently, the correct procedure is to fully tune and train your model on the original data set, then (using the model) calculate the importances of the variables. Remove $k$ of them, then retrain and tune the model on the feature subset and repeat the process.

I am a python user, so I will speak to that: if you are using sklearn's RFE or RFECV function, this form of the algorithm is not done. Instead, you pass a model (presumably already tuned), and the entire RFE algorithm is performed with that model. While I can offer no formal proof, I suspect that if you use the same model for RFE selection, you will likely overfit your data, and so care should probably be taken when using sklearn's RFE or RFECV functions out of the box. Though, since RFECV does indeed perform cross-validation, it is likely the better choice.

As to whether RFE should be done with an $l_1$ penalty in the model -- I don't know. My anecdotal evidence was that, upon trying this, my model (a linear support vector classifier) did not generalize well. However, this was for a particular data set with some problems. Take that statement with a large spoon of salt.

I will update this post if I learn more.