Solved – Pre-processing (center, scale, impute) among training sets (different forms) and the test set – what is a good approach

data preprocessingdatasetmachine learning

I am currently working on a multi-class classification problem with a large training set. However, it has some specific characteristics, which induced me to experiment with it, resulting in few versions of the training set (as a result of re-sampling, removing observations, etc).

I want to perform pre-processing of the data, that is to scale, center and impute (not much imputation though) values. This is the point where I've started to get confused.

I've been taught that you should always pre-process the test set in the same way you've pre-processed the training set, that is (for scaling and centering) to measure the mean and standard deviation on the training set and apply those values to the test set. This seems reasonably to me.

But what to do in case when you have shrinked/resampled training set? Should one focus on characteristics of the data that is actually feeding the model (that is what would 'train' function in R's caret package suggest, as you can put the pre-processing object in there directly) and apply these to the test set, or maybe one should capture the real characteristics of the data (from the whole untouched training set) and apply these? If the second option is better, maybe it would be worth it to capture the characteristics of the data by merging the training and test data together just for pre-processing step to get as accurate estimates as possible (I've actually never heard of anyone doing that though)?

I know I can simply test some of the approaches specified here, and I surely will, but are there any suggestions based on theory or your intuition/experience on how to tackle this problem?

I also have one additional and optional question. Does it make sense to center but NOT scale the data (or the other way around) in any case? Can anyone present any example where that approach would be reasonable?

Best Answer

If I understand your question, one of your ideas is to calculate a normalization (center and scale) across all of your data: both test and training.

Imagine that you take all of your data and calculate a centering and a scaling once, then use this on both the training set and test set. Your training set represents the data you have now, your test set represents future data that you don't have when you are training. But you somehow magically calculated centering and scaling values that included this future data. A Leak From The Future (tm), which is bad.