Obtain the mean values and standard deviations of the training set, whatever it may be, and apply those values to the test set. The basic assumption of any machine learning method is that all the data comes from the same distribution, and ideally you should apply the same data-dependent transformations to the test set.
If you normalize both together you have a better estimate of the normalization parameters but you would also have added a small information leak: there's no way your final model could've been based on parameters from unseen data.
In practice though there might be not much difference in performance, so you will see many people doing it.
How to do it in R:
##Let's create a 0.7 split from a dataset
data = iris
s = sample(1:nrow(data), 0.7*nrow(data))
train = data[s,]
test = data[-s,]
##Here we obtain the column means and sd, copied from the `mlr` tutorial
##http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html#writing-a-custom-preprocessing-wrapper
trainfun = function(data, target, args = list(center, scale)) {
## Identify numerical features
cns = colnames(data)
nums = setdiff(cns[sapply(data, is.numeric)], target)
## Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = args$center, scale = args$scale)
## Store the scaling parameters in control
## These are needed to preprocess the data before prediction
control = args
if (is.logical(control$center) && control$center)
control$center = attr(x, "scaled:center")
if (is.logical(control$scale) && control$scale)
control$scale = attr(x, "scaled:scale")
## Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(list(data = data, control = control))
}
predictfun = function(data, target, args, control) {
## Identify numerical features
cns = colnames(data)
nums = cns[sapply(data, is.numeric)]
## Extract numerical features from the data set and call scale
x = as.matrix(data[, nums, drop = FALSE])
x = scale(x, center = control$center, scale = control$scale)
## Recombine the data
data = data[, setdiff(cns, nums), drop = FALSE]
data = cbind(data, as.data.frame(x))
return(data)
}
train = trainfun(train, "Species", list(center = TRUE, scale = TRUE))
##So these are respectively your scaling parameters and your train data
control = train$control
train = train$data
##We apply it to the test set
test = predictfun(test, "Species", list(center = TRUE, scale = TRUE), control = control)
##Check the mean and sd of every columnn in train and test
cbind(train.mean = lapply(train[,-1], mean),train.sd = lapply(train[,-1], sd),test.mean = lapply(test[,-1], mean),test.sd = lapply(test[,-1], sd))
train.mean train.sd test.mean test.sd
Sepal.Length 2.917201e-16 1 0.09530032 0.9208225
Sepal.Width 2.444938e-16 1 -0.005609797 0.8804698
Petal.Length -1.120497e-16 1 0.0977674 0.9124194
Petal.Width 1.08351e-16 1 0.09497264 0.9345891
Also keep in mind there are many frameworks in R that can make that automatically, like caret
and mlr
. Also, if your algorithm includes internal normalization (center/scale) then this will be done without your interference.
Best Answer
You are supposed to use the parameters from the training data set to standardize the test data.
This is because in real life we only ever have access to some subset of the total population of data. When we deploy a data-driven (statistical) model it needs to adapt to new, unseen data. Since we do not have access to this new, unseen data we cannot realistically have some large "group" of it to do a separate standerdization on it. The test data is meant to simulate this part of our reality somewhat - there some data out there that we don't have access to, and eventually our model needs to process it based on "what it knows".
Sure perhaps we can always "wait" for more data to arrive, and re-train models + update statistics, but that is often expensive. Therefore we usually train once (in a long while), depending on what data we have available at the start, and try to estimate generalization error and our model's adaptability by ensuring we don't play around at all with test data during model building and data pre-processing.
Also as brought up in a comment above - it is important that the training and test come from (are reflective of) the same distribution - this is an assumption we place in our modelling.