Solved – Cross Validation with Preprocessing (Normalization, Discretization, Feature Selection)

data preprocessingmachine learningvalidationweka

I am now trying to evaluate my model with cross validation.
My dataset contains some numeric and nominal attributes.

Here, I carry out the following data preprocessing tasks:

A. Normalization: Min-Max Normalization (to [0,1])

B. Discretization: Supervised Discretization (Fayyad-Irani), creating bins with some supervised technique (using class label info)

C. Attribute Selection: Correlation based Feature Selection Method

Actually, I first tried to do preprocessing for the entire set, obtaining preprocessed & reduced dataset, then evaluated my model with the dataset through 10-fold cross validation.

However, I found that the way is commonly (miss)used but not good, optimal because of cheating for the test set.

Hence, I am trying to do preprocessing within cross validation. (Yes, preprocessing for only training set for each fold!)

Here, I have a question. As I know, I think using Normalization Filter obtained from training set should be used for the test set. However, if the range of numeric values in the test set is not covered by that in the training set… Then, the normalization result does not create [0,1] range for the test set.
For example, if training set has the range from 30 to 50, but test set has the range from 10 to 100, then the normalization filter obtained from the training set looks not appropriate for the test set.

In this situation, how should I do?

(Plus) Is it acceptable way to do only Normalization and Discretization for the entire set, and do feature selection within the cross validation job?

Thank you in advance!
I look forward to receiving very helpful answers! 🙂

Best Answer

Doing preprocessing out of the cross validation loop is especially bad if feature selection is performed (esp when you have large feature size) but not so much for data normalization, as by scaling either by 1 or 100, these numbers already has a predetermined meaning that there's nothing that the model can cheat and learn about the left-out set.

If you have a problem about this, it reflects more about a programming defect than a mathematical problem. A work around to just to first make the lower and upper bound for your bin incorporate all your data. Yet I don't think packages nowadays have this problem.

Related Question