Solved – Why do we normalize test data on the parameters of the training data

feature-scalingmachine learningnormalization

I just built a toy linear regression model with gradient descent, coding it from scratch. It was doing fine on test data, but it was off on training data. In the end I figured that I was normalizing new data according to its own mean and range, instead of using the mean and range of the training data.

And I realized that I never understood why this doesn't work, and I never found an explainer. Intuitively, I see normalization as a way of "rewriting" the data, without changing its structure. When I get the test data, I can easily calculate its mean and range and standard deviation. So shouldn't I "rewrite" it in terms of itself? Doing it with the statistics of the training data also feels a bit like cheating, since we typically avoid any contact between training set and testing set.

Best Answer

You are supposed to use the parameters from the training data set to standardize the test data.

This is because in real life we only ever have access to some subset of the total population of data. When we deploy a data-driven (statistical) model it needs to adapt to new, unseen data. Since we do not have access to this new, unseen data we cannot realistically have some large "group" of it to do a separate standerdization on it. The test data is meant to simulate this part of our reality somewhat - there some data out there that we don't have access to, and eventually our model needs to process it based on "what it knows".

Sure perhaps we can always "wait" for more data to arrive, and re-train models + update statistics, but that is often expensive. Therefore we usually train once (in a long while), depending on what data we have available at the start, and try to estimate generalization error and our model's adaptability by ensuring we don't play around at all with test data during model building and data pre-processing.

Also as brought up in a comment above - it is important that the training and test come from (are reflective of) the same distribution - this is an assumption we place in our modelling.

Related Solutions

Solved – How to normalize mixed continuous/discrete features for DNN

You said your integer-valued variables have a fixed range, which makes things relatively easy. You will need to encode them as several features rather than one.

If they are categorical (there is no inherent natural order within them), you can use one-hot encoding (i.e. the "fifth option" is represented as (0,0,0,0,1,0,0)), this is the common practice.
If they are ordinal (discrete, but with an order), using one-hot is possible, but it's even better to use (1,1,1,1,1,0,0). The closest reference I'm aware of is "A neural network approach to ordinal regression".

PS: If they can assume arbitrarily high values, it depends on the problem and goals at hand. For example, they can be represented as one variable with an accordingly adjusted cost function.

Solved – Ranging [0,1] test set with parameters from training set

Obtain the mean values and standard deviations of the training set, whatever it may be, and apply those values to the test set. The basic assumption of any machine learning method is that all the data comes from the same distribution, and ideally you should apply the same data-dependent transformations to the test set.

If you normalize both together you have a better estimate of the normalization parameters but you would also have added a small information leak: there's no way your final model could've been based on parameters from unseen data.

In practice though there might be not much difference in performance, so you will see many people doing it.

How to do it in R:

##Let's create a 0.7 split from a dataset
data = iris
s = sample(1:nrow(data), 0.7*nrow(data))

train = data[s,]
test = data[-s,]

##Here we obtain the column means and sd, copied from the `mlr` tutorial
##http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html#writing-a-custom-preprocessing-wrapper
trainfun = function(data, target, args = list(center, scale)) {
  ## Identify numerical features
  cns = colnames(data)
  nums = setdiff(cns[sapply(data, is.numeric)], target)
  ## Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = args$center, scale = args$scale)
  ## Store the scaling parameters in control
  ## These are needed to preprocess the data before prediction
  control = args
  if (is.logical(control$center) && control$center)
    control$center = attr(x, "scaled:center")
  if (is.logical(control$scale) && control$scale)
    control$scale = attr(x, "scaled:scale")
  ## Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(list(data = data, control = control))
}

predictfun = function(data, target, args, control) {
  ## Identify numerical features
  cns = colnames(data)
  nums = cns[sapply(data, is.numeric)]
  ## Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = control$center, scale = control$scale)
  ## Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]  
  data = cbind(data, as.data.frame(x))
  return(data)
}

train = trainfun(train, "Species", list(center = TRUE, scale = TRUE))

##So these are respectively your scaling parameters and your train data
control = train$control
train = train$data

##We apply it to the test set
test = predictfun(test, "Species", list(center = TRUE, scale = TRUE), control = control)

##Check the mean and sd of every columnn in train and test
cbind(train.mean = lapply(train[,-1], mean),train.sd = lapply(train[,-1], sd),test.mean = lapply(test[,-1], mean),test.sd = lapply(test[,-1], sd))

             train.mean    train.sd test.mean    test.sd  
Sepal.Length 2.917201e-16  1        0.09530032   0.9208225
Sepal.Width  2.444938e-16  1        -0.005609797 0.8804698
Petal.Length -1.120497e-16 1        0.0977674    0.9124194
Petal.Width  1.08351e-16   1        0.09497264   0.9345891

Also keep in mind there are many frameworks in R that can make that automatically, like caret and mlr. Also, if your algorithm includes internal normalization (center/scale) then this will be done without your interference.

Best Answer

Related Solutions

Solved – How to normalize mixed continuous/discrete features for DNN

Solved – Ranging [0,1] test set with parameters from training set

Related Question