Solved – Cross-validation of a machine learning pipeline

feature selectionmodelmodel selectionpythonscikit learn

I want to find the best model process for a machine learning pipeline. In other words, normalize $\rightarrow$ feature select $\rightarrow$ test model performance. For example, let's say I want to try Ridge, Lasso, and Elastic Net regression and I am doing normalization, feature selection, and a cross-validated hyperparameter search for all models. I want to pick the best out of the three.

Does it theoretically make sense to run cross-validation where I run the entire pipeline on each of the left out folds? In SKLearn, something like this:

models = [Ridge(), Lasso(), ElasticNet()]

for model in models:
    pipe = Pipeline([('scaling', scaler),
                      ('feature_selection', selectorCV),
                      ('param_searcm', gridsearchCV)])
    scores.append(cross_val_score(pipe, X, y))

# get the best model pipeline from cross_val_score, fit on all my data, 
#  and whoop there is my best model

Best Answer

From the look of the code, it implies that the whole model fitting procedure is being cross-validated for each model, which is indeed the right way to go about it. Cross-validation is best viewed as a method for evaluating the performance of a procedure for fitting a model, not the fitted model itself. So if you perform feature selection, then that needs to be performed independently in each fold of the cross-validation, if you tune hyper-parameters, then they need to be tuned independently in each fold of the cross-validation. It is important to remember however that the cross-validation score of the best model will be an optimistically biased estimate of the performance of the final system as it has been directly optimised to choose the best model.

Related Solutions

Solved – Having trouble understanding cross-validation results from scikit-learn

From section 7.10.2 of Elements of Statistical Learning(free online, and it's great):

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels
Using just this subset of predictors, build a multivariate classifier.
Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation? Consider a scenario with N = 50 samples in two equal-sized classes, and p = 5000 quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

What has happened? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not cor-rectly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.

We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors with the class labels over just these 10 samples (top panel). We see that the correlations average about 0.28, rather than 0, as one might expect

Solved – RFE number of features with hyperparameter fine tuning within cros-validation

You can easily search both parameters in a single GridSearchCV:

param_grid = {'n_features': [1, 2, 3], 'estimator__C': [0.1, 0.001]}

This will be "inefficient" in that it will rebuild RFE from scratch for 1, 2, 3 features. The most efficient way would be running RFECV several times for different values of C and let RFECV do the cross-validation. That's not very elegant, though, and being able to do this efficiently with GridSearchCV would be ideal. I have been wanting to work for this apparently since 2013: https://github.com/scikit-learn/scikit-learn/issues/1626

Best Answer

Related Solutions

Solved – Having trouble understanding cross-validation results from scikit-learn

Solved – RFE number of features with hyperparameter fine tuning within cros-validation

Related Question