I am working on Machine Learning, particularly I have a dataset with 50+ columns and 100,000 rows. I need to get the data normalized with ranging to [0,1] (not with standardization) and I've split the dataset in a 80/20 percentatge for the training/test sets.
My question is: I must normalize first the training set and then normalize the test set with the means and deviations extracted from the training set normalization. How can I do that to each one of the columns? I mean, is there a defined method to get the (mean, deviation) tuples for every column in the training set in order to be able to normalize the test set with those values?
Best Answer
Obtain the mean values and standard deviations of the training set, whatever it may be, and apply those values to the test set. The basic assumption of any machine learning method is that all the data comes from the same distribution, and ideally you should apply the same data-dependent transformations to the test set.
If you normalize both together you have a better estimate of the normalization parameters but you would also have added a small information leak: there's no way your final model could've been based on parameters from unseen data.
In practice though there might be not much difference in performance, so you will see many people doing it.
How to do it in R:
Also keep in mind there are many frameworks in R that can make that automatically, like
caret
andmlr
. Also, if your algorithm includes internal normalization (center/scale) then this will be done without your interference.