I want to find the best model process for a machine learning pipeline. In other words, normalize $\rightarrow$ feature select $\rightarrow$ test model performance. For example, let's say I want to try Ridge, Lasso, and Elastic Net regression and I am doing normalization, feature selection, and a cross-validated hyperparameter search for all models. I want to pick the best out of the three.
Does it theoretically make sense to run cross-validation where I run the entire pipeline on each of the left out folds? In SKLearn
, something like this:
models = [Ridge(), Lasso(), ElasticNet()]
for model in models:
pipe = Pipeline([('scaling', scaler),
('feature_selection', selectorCV),
('param_searcm', gridsearchCV)])
scores.append(cross_val_score(pipe, X, y))
# get the best model pipeline from cross_val_score, fit on all my data,
# and whoop there is my best model
Best Answer
From the look of the code, it implies that the whole model fitting procedure is being cross-validated for each model, which is indeed the right way to go about it. Cross-validation is best viewed as a method for evaluating the performance of a procedure for fitting a model, not the fitted model itself. So if you perform feature selection, then that needs to be performed independently in each fold of the cross-validation, if you tune hyper-parameters, then they need to be tuned independently in each fold of the cross-validation. It is important to remember however that the cross-validation score of the best model will be an optimistically biased estimate of the performance of the final system as it has been directly optimised to choose the best model.