Solved – Cross-validation of a machine learning pipeline

feature selectionmodelmodel selectionpythonscikit learn

I want to find the best model process for a machine learning pipeline. In other words, normalize $\rightarrow$ feature select $\rightarrow$ test model performance. For example, let's say I want to try Ridge, Lasso, and Elastic Net regression and I am doing normalization, feature selection, and a cross-validated hyperparameter search for all models. I want to pick the best out of the three.

Does it theoretically make sense to run cross-validation where I run the entire pipeline on each of the left out folds? In SKLearn, something like this:

models = [Ridge(), Lasso(), ElasticNet()]

for model in models:
    pipe = Pipeline([('scaling', scaler),
                      ('feature_selection', selectorCV),
                      ('param_searcm', gridsearchCV)])
    scores.append(cross_val_score(pipe, X, y))

# get the best model pipeline from cross_val_score, fit on all my data, 
#  and whoop there is my best model

Best Answer

From the look of the code, it implies that the whole model fitting procedure is being cross-validated for each model, which is indeed the right way to go about it. Cross-validation is best viewed as a method for evaluating the performance of a procedure for fitting a model, not the fitted model itself. So if you perform feature selection, then that needs to be performed independently in each fold of the cross-validation, if you tune hyper-parameters, then they need to be tuned independently in each fold of the cross-validation. It is important to remember however that the cross-validation score of the best model will be an optimistically biased estimate of the performance of the final system as it has been directly optimised to choose the best model.

Related Question