Cross-Validation – Why Pipelining Preprocessing Steps Should be Mandatory in Cross Validation

cross-validationscikit learn

I'm currently experimenting with sklearns transformer classes in combination with pandas. During my application on a kaggle data set and being picky on not choosing information of the test data, I realized the following:

Typically, in those challenges you have a training data set and a test data set. We perform preprocessing and say we impute the missing values in the train and test data set by the mean of the training data. Then we do our usual stuff of finding a valid model.

But wait, let's forget the test data set for a moment (on kaggle you don't even have values for the response variable on the test set). We want to estimate the generalizability of our model, this can be achieved by using cross validation on the training data. But on each testing fold, we potentially imputed missing data, i.e. we have information of the corresponding training set!

For example imputing by mean will yield different values on the whole training set and each cv fold.

So, shouldn't data preprocessing steps always be included in the pipeline?
Can this be a severe problem?

Best Answer

So, shouldn't data preprocessing steps always be included in the pipeline?

Ideally, it should. It's part of the model after all. A case can be made when you have lots of data, when the estimates across folds end being almost the same. But in general it's better to assess the whole model building process performance, and that includes preprocessing when it takes into account training data.

Can this be a severe problem?

With big data probably not. With small data (and include this categorical variables with a lot of levels), this can potentially add a huge bias to your predictions. It's mostly due to the fact that your estimates are of lower quality when you only use the train data, and the estimates might jump a lot across folds, degrading your model.

Related Question