Solved – Why do we normalize test data on the parameters of the training data

feature-scalingmachine learningnormalization

I just built a toy linear regression model with gradient descent, coding it from scratch. It was doing fine on test data, but it was off on training data. In the end I figured that I was normalizing new data according to its own mean and range, instead of using the mean and range of the training data.

And I realized that I never understood why this doesn't work, and I never found an explainer. Intuitively, I see normalization as a way of "rewriting" the data, without changing its structure. When I get the test data, I can easily calculate its mean and range and standard deviation. So shouldn't I "rewrite" it in terms of itself? Doing it with the statistics of the training data also feels a bit like cheating, since we typically avoid any contact between training set and testing set.

Best Answer

You are supposed to use the parameters from the training data set to standardize the test data.

This is because in real life we only ever have access to some subset of the total population of data. When we deploy a data-driven (statistical) model it needs to adapt to new, unseen data. Since we do not have access to this new, unseen data we cannot realistically have some large "group" of it to do a separate standerdization on it. The test data is meant to simulate this part of our reality somewhat - there some data out there that we don't have access to, and eventually our model needs to process it based on "what it knows".

Sure perhaps we can always "wait" for more data to arrive, and re-train models + update statistics, but that is often expensive. Therefore we usually train once (in a long while), depending on what data we have available at the start, and try to estimate generalization error and our model's adaptability by ensuring we don't play around at all with test data during model building and data pre-processing.

Also as brought up in a comment above - it is important that the training and test come from (are reflective of) the same distribution - this is an assumption we place in our modelling.

Related Question