Solved – RFE number of features with hyperparameter fine tuning within cros-validation

feature selectionhyperparametermachine learningscikit learn

I would like to use cross-validation to select the number of optimal features to select (n_features_to_select) in the recursive feature elimination algorithm (RFE) and the optimal hyperparameter of an algorithm, say it the Penalty parameter C in the Support Vector Machine (SVC). The idea is to explore all the possible combinations in a grid provided by both. You have below an example that I have implemented quickly:

from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.model_selection import ParameterGrid, StratifiedKFold
import numpy as np

# Create simulated data
X,y = make_classification(n_samples =50,
                          n_features=5, 
                          n_informative=3, 
                          n_redundant=0,
                          random_state=0)

param_grid = {'C': [0.01,0.1,1], 
              'n_features': [1, 2, 3, 4, 5]}
cv = StratifiedKFold(random_state=0)

cv_res=[]
for params in ParameterGrid(param_grid):

    cv_folds = []
    for train_index, val_index in cv.split(X, y):

        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]

        svc = SVC(kernel="linear", C=params['C'], random_state=0)        
        rfe = RFE(estimator=svc, n_features_to_select=params['n_features'], step=1)
        rfe.fit(X_train, y_train)

        cv_folds.append(rfe.score(X_val, y_val))

    cv_res.append(np.mean(cv_folds))
    print("combination of parameters: " + str(params) + " ended ")

As you can see, for each combination of n_features_to_select and C, I run a cross-validation of 3 folds in this case and save the accuracy in each fold. I then average across folds for each combination of hyperparameters and the optimal combination to take would be that with the highest average accuracy across the folds. I have been thinking about a better and quickest way of implementing this using the module GridSearchCV, but the only possibility that I came up with was using the module RFECV inside the grid search for C, but this would create two nested cross-validation, an outer exploring C and the inner exploring n_features_to_select, and I don't want this.

Any idea on how to tackle this in a more efficient way using scikit functionalities?

thanks in advance

Best Answer

You can easily search both parameters in a single GridSearchCV:

param_grid = {'n_features': [1, 2, 3], 'estimator__C': [0.1, 0.001]}

This will be "inefficient" in that it will rebuild RFE from scratch for 1, 2, 3 features. The most efficient way would be running RFECV several times for different values of C and let RFECV do the cross-validation. That's not very elegant, though, and being able to do this efficiently with GridSearchCV would be ideal. I have been wanting to work for this apparently since 2013: https://github.com/scikit-learn/scikit-learn/issues/1626

Related Solutions

Machine Learning – Cross Validation and Hyperparameter Tuning Workflow

... I use train and test sets for building and testing my model (this includes all the preprocessing steps and nested cv) and use the valid set to test my final model.

Test set is typically used as final evaluation and validation set for tuning. So, I'll be using the general convention below.

Do we use Nested Cross validation to tune the hyperparameters during the cross validation process or do we first select the best performing algorithm via cross validation and then tune the hyperparameter for only that algorithm?

You shouldn't compare models without tuning them. One way to do is nested cross validation where we have two levels of validation sets, i.e. train_inner + validation_inner + validation_outer. Each algorithm's hyperparameters (HP) are tuned on validation_inner. Then, in the outer loop, each algorithm with its best HP set is trained on train_inner + validation_inner, which is train_outer, and tested on validation_outer. If this is CV, the sets of best HPs change in each outer loop evaluation, but in the end the two algorithms are compared. The winner algorithm will be tuned on train_outer and tuned validation_outer.

Finally, the best model with its best HP is trained on train_outer + validation_outer, which we can call train, and it's tested on a test set for last performance report.

One other way would be linearizing everything to a model list, e.g.

models = [RF(n_est=10), RF(n_est=100), SVM(C=1), SVM(C=10)]

and selecting the best among them using a single validation level, without nesting. This may be desirable when compute time is a concern since nested CV takes more time and resources.

About your Method 1, you shouldn't tune your HP and test your success on the same set:

clf = GridSearchCV(model, ...)
clf.fit(X_train, y_train)
score = cross_val_score(clf, X_train, y_train, ...)

This is an optimistic view on the success of the tuned model since it's being measured on the test it was tuned.

About your Method 2, the nested_score should be calculated solely on the test set, not X_iris, which the full dataset. Because, it also contains the training set.

None of the implementations you shared conforms with the methods I explained above.

A paper that explains both methods I proposed and favours the second one (calls it as flat cv): https://arxiv.org/pdf/1809.09446.pdf

Best Answer

Related Solutions

Machine Learning – Cross Validation and Hyperparameter Tuning Workflow

Related Question