Solved – Pre-Processing – Applied on all three (training/validation/test) sets

data preprocessing

From what I understand from previously answered questions, you're meant to do your pre-processing on each set after splitting your data into training and test sets. But I'm not sure where the validation set comes into this. Do I also pre-process it separately to the training set? Or do I pre-process the training set as a whole and then separate the validation set?

I'm 99% sure you're meant to do all three of them separately, but the way my assignment is worded put me in some doubt so I thought I'd seek an answer/opinion here.

Best Answer

You should do the same preprocessing on all your data however if that preprocessing depends on the data (e.g. standardization, pca) then you should calculate it on your training data and then use the parameters from that calculation to apply it to your validation and test data.

For example if you are centering your data (subtracting the mean) then you should calculate the mean on your training data ONLY and then subtract that same mean from all your data (i.e. subtract the mean of the training data from the validation and test data, DO NOT calculate 3 separate means).

For cross-validation, you'll have to calculate it for each iteration on the folds in the training set and then apply that calculation to the validation fold. If you then train a model using all your data after that, then you need to find the parameters for the preprocessing step using all the CV data.