Solved – RFE number of features with hyperparameter fine tuning within cros-validation

feature selectionhyperparametermachine learningscikit learn

I would like to use cross-validation to select the number of optimal features to select (n_features_to_select) in the recursive feature elimination algorithm (RFE) and the optimal hyperparameter of an algorithm, say it the Penalty parameter C in the Support Vector Machine (SVC). The idea is to explore all the possible combinations in a grid provided by both. You have below an example that I have implemented quickly:

from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.model_selection import ParameterGrid, StratifiedKFold
import numpy as np

# Create simulated data
X,y = make_classification(n_samples =50,
                          n_features=5, 
                          n_informative=3, 
                          n_redundant=0,
                          random_state=0)

param_grid = {'C': [0.01,0.1,1], 
              'n_features': [1, 2, 3, 4, 5]}
cv = StratifiedKFold(random_state=0)

cv_res=[]
for params in ParameterGrid(param_grid):

    cv_folds = []
    for train_index, val_index in cv.split(X, y):

        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]

        svc = SVC(kernel="linear", C=params['C'], random_state=0)        
        rfe = RFE(estimator=svc, n_features_to_select=params['n_features'], step=1)
        rfe.fit(X_train, y_train)

        cv_folds.append(rfe.score(X_val, y_val))

    cv_res.append(np.mean(cv_folds))
    print("combination of parameters: " + str(params) + " ended ")

As you can see, for each combination of n_features_to_select and C, I run a cross-validation of 3 folds in this case and save the accuracy in each fold. I then average across folds for each combination of hyperparameters and the optimal combination to take would be that with the highest average accuracy across the folds. I have been thinking about a better and quickest way of implementing this using the module GridSearchCV, but the only possibility that I came up with was using the module RFECV inside the grid search for C, but this would create two nested cross-validation, an outer exploring C and the inner exploring n_features_to_select, and I don't want this.

Any idea on how to tackle this in a more efficient way using scikit functionalities?

thanks in advance

Best Answer

You can easily search both parameters in a single GridSearchCV:

param_grid = {'n_features': [1, 2, 3], 'estimator__C': [0.1, 0.001]}

This will be "inefficient" in that it will rebuild RFE from scratch for 1, 2, 3 features. The most efficient way would be running RFECV several times for different values of C and let RFECV do the cross-validation. That's not very elegant, though, and being able to do this efficiently with GridSearchCV would be ideal. I have been wanting to work for this apparently since 2013: https://github.com/scikit-learn/scikit-learn/issues/1626

Related Question