Cross-Validation – The Importance of Normalization Prior to Cross-Validation

cross-validationnormalization

Does normalizing data (to have zero mean and unity standard deviation) prior to performing a repeated k-fold cross-validation have any negative conquences such as overfitting?

Note: this is for a situation where #cases > total #features

I am transforming some of my data using a log transform, then normalizing all data as above. I am then performing feature selection. Next I apply the selected features and normalized data to a repeated 10-fold cross-validation to try and estimate generalized classifier performance and am concerned that using all data to normalize may not be appropriate. Should I normalize the test data for each fold using normalizing data obtained from the training data for that fold?

Any opinions gratefully received! Apologies if this question seems obvious.

Edit:
On testing this (in line with suggestions below) I found that normalization prior to CV did not make much difference performance-wise when compared to normalization within CV.

Best Answer

To answer your main question, it would be optimal and more appropiate to scale within the CV. But it will probably not matter much and might not be important in practice at all if your classifier rescales the data, which most do (at least in R).

However, selecting feature before cross validating is a BIG NO and will lead to overfitting, since you will select them based on how they perform on the whole data set. The log-transformation is ok to perform outside, since the transformation does not depend on the actual data (more on the type of data) and is not something you would not do if you had only 90% of the data instead of 100% and is not tweaked according to the data.

To also answer your comment, obviously whether it will result in overfitting will depend on your manner of feature selection. If you choose them by chance (why would you do that?) or because of a priori theoretical considerations (other literature) it won't matter. But if it depends on your data set it will. Elements of Statistical Learnings has a good explanation. You can freely and legally download a .pdf here http://www-stat.stanford.edu/~tibs/ElemStatLearn/

The point concerning you is in section 7.10.2 on page 245 of the fifth printing. It is titled "The Wrong and Right Ways to do Cross-validation".