PCA and Cross-Validation – Combining PCA, Feature Scaling, and Cross-Validation Without Data Leakage

cross-validationdata-leakagemachine learningpcascikit learn

The sci-kit learn documentation for cross-validation says the following about using feature-scaling and cross-validation:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction

I understand the reason behind this is to prevent information leakage between the training & test sets during cross-validation, which could result in an optimistic estimate of model performance.

I am wondering then, if I wish to use Principal Component Analysis to reduce the size of a feature set before training say a regression model, and PCA requires feature-scaling to be effective, how do I chain feature-scaling to PCA to cross-validated regression, without introducing data leakage between the train-test splits in the cross-validation?

Best Answer

You need to think feature scaling, then pca, then your regression model as an unbreakable chain of operations (as if it is a single model), in which the cross validation is applied upon. This is quite tricky to code it yourself but considerably easy in sklearn via Pipelines. A pipeline object is a cascade of operators on the data that is regarded (and acts) as a seemingly single model confirming to fit and predict paradigm in the library.

Related Question