I would like to use cross-validation to select the number of optimal features to select (n_features_to_select) in the recursive feature elimination algorithm (RFE) and the optimal hyperparameter of an algorithm, say it the Penalty parameter C in the Support Vector Machine (SVC). The idea is to explore all the possible combinations in a grid provided by both. You have below an example that I have implemented quickly:
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.model_selection import ParameterGrid, StratifiedKFold
import numpy as np
# Create simulated data
X,y = make_classification(n_samples =50,
n_features=5,
n_informative=3,
n_redundant=0,
random_state=0)
param_grid = {'C': [0.01,0.1,1],
'n_features': [1, 2, 3, 4, 5]}
cv = StratifiedKFold(random_state=0)
cv_res=[]
for params in ParameterGrid(param_grid):
cv_folds = []
for train_index, val_index in cv.split(X, y):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
svc = SVC(kernel="linear", C=params['C'], random_state=0)
rfe = RFE(estimator=svc, n_features_to_select=params['n_features'], step=1)
rfe.fit(X_train, y_train)
cv_folds.append(rfe.score(X_val, y_val))
cv_res.append(np.mean(cv_folds))
print("combination of parameters: " + str(params) + " ended ")
As you can see, for each combination of n_features_to_select and C, I run a cross-validation of 3 folds in this case and save the accuracy in each fold. I then average across folds for each combination of hyperparameters and the optimal combination to take would be that with the highest average accuracy across the folds. I have been thinking about a better and quickest way of implementing this using the module GridSearchCV, but the only possibility that I came up with was using the module RFECV inside the grid search for C, but this would create two nested cross-validation, an outer exploring C and the inner exploring n_features_to_select, and I don't want this.
Any idea on how to tackle this in a more efficient way using scikit functionalities?
thanks in advance
Best Answer
You can easily search both parameters in a single GridSearchCV:
This will be "inefficient" in that it will rebuild RFE from scratch for 1, 2, 3 features. The most efficient way would be running RFECV several times for different values of C and let RFECV do the cross-validation. That's not very elegant, though, and being able to do this efficiently with GridSearchCV would be ideal. I have been wanting to work for this apparently since 2013: https://github.com/scikit-learn/scikit-learn/issues/1626