Solved – Cross-validation of a machine learning pipeline

I want to find the best model process for a machine learning pipeline. In other words, normalize $\rightarrow$ feature select $\rightarrow$ test model performance. For example, let's say I want to try Ridge, Lasso, and Elastic Net regression and I am doing normalization, feature selection, and a cross-validated hyperparameter search for all models. I want to pick the best out of the three.

Does it theoretically make sense to run cross-validation where I run the entire pipeline on each of the left out folds? In SKLearn, something like this:

models = [Ridge(), Lasso(), ElasticNet()]

for model in models:
    pipe = Pipeline([('scaling', scaler),
                      ('feature_selection', selectorCV),
                      ('param_searcm', gridsearchCV)])
    scores.append(cross_val_score(pipe, X, y))

# get the best model pipeline from cross_val_score, fit on all my data, 
#  and whoop there is my best model

Best Answer

From the look of the code, it implies that the whole model fitting procedure is being cross-validated for each model, which is indeed the right way to go about it. Cross-validation is best viewed as a method for evaluating the performance of a procedure for fitting a model, not the fitted model itself. So if you perform feature selection, then that needs to be performed independently in each fold of the cross-validation, if you tune hyper-parameters, then they need to be tuned independently in each fold of the cross-validation. It is important to remember however that the cross-validation score of the best model will be an optimistically biased estimate of the performance of the final system as it has been directly optimised to choose the best model.

