Solved – Why standardization of the testing set has to be performed with the mean and sd of the training set

centeringdata preprocessingmachine learning

In pre-processing the data set before applying a machine learning algorithm the data can be centered by subtracting the mean of the variable, and scaled by dividing by the standard deviation.

This is a straightforward process in the training set, but when it comes to the testing set, the procedure seems more ad hoc. I have read that the mean that is subtracted from each value in the testing set is the mean of the training set, not the testing set; and the same goes for the standard deviation.

Is there really a mathematical need behind this asymmetry, or is it an exercise in sticking to the principle of not touching the testing set until the end – more of a "philosophical" heuristic?

Best Answer

When you center and scale a variable in the training data using the mean and sd of that variable calculated on the training data, you are essentially creating a brand-new variable. Then you are doing, say, a regression on that brand new variable.

To use that new variable to predict for the validation and/or test datasets, you have to create the same variable in those data sets. Subtracting a different number and dividing by a different number does not create the same variable.