Solved – Ranging [0,1] test set with parameters from training set

machine learningnormalizationr

I am working on Machine Learning, particularly I have a dataset with 50+ columns and 100,000 rows. I need to get the data normalized with ranging to [0,1] (not with standardization) and I've split the dataset in a 80/20 percentatge for the training/test sets.

My question is: I must normalize first the training set and then normalize the test set with the means and deviations extracted from the training set normalization. How can I do that to each one of the columns? I mean, is there a defined method to get the (mean, deviation) tuples for every column in the training set in order to be able to normalize the test set with those values?

Best Answer

Obtain the mean values and standard deviations of the training set, whatever it may be, and apply those values to the test set. The basic assumption of any machine learning method is that all the data comes from the same distribution, and ideally you should apply the same data-dependent transformations to the test set.

If you normalize both together you have a better estimate of the normalization parameters but you would also have added a small information leak: there's no way your final model could've been based on parameters from unseen data.

In practice though there might be not much difference in performance, so you will see many people doing it.


How to do it in R:

##Let's create a 0.7 split from a dataset
data = iris
s = sample(1:nrow(data), 0.7*nrow(data))

train = data[s,]
test = data[-s,]

##Here we obtain the column means and sd, copied from the `mlr` tutorial
##http://mlr-org.github.io/mlr-tutorial/devel/html/preproc/index.html#writing-a-custom-preprocessing-wrapper
trainfun = function(data, target, args = list(center, scale)) {
  ## Identify numerical features
  cns = colnames(data)
  nums = setdiff(cns[sapply(data, is.numeric)], target)
  ## Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = args$center, scale = args$scale)
  ## Store the scaling parameters in control
  ## These are needed to preprocess the data before prediction
  control = args
  if (is.logical(control$center) && control$center)
    control$center = attr(x, "scaled:center")
  if (is.logical(control$scale) && control$scale)
    control$scale = attr(x, "scaled:scale")
  ## Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(list(data = data, control = control))
}

predictfun = function(data, target, args, control) {
  ## Identify numerical features
  cns = colnames(data)
  nums = cns[sapply(data, is.numeric)]
  ## Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = control$center, scale = control$scale)
  ## Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]  
  data = cbind(data, as.data.frame(x))
  return(data)
}

train = trainfun(train, "Species", list(center = TRUE, scale = TRUE))

##So these are respectively your scaling parameters and your train data
control = train$control
train = train$data

##We apply it to the test set
test = predictfun(test, "Species", list(center = TRUE, scale = TRUE), control = control)

##Check the mean and sd of every columnn in train and test
cbind(train.mean = lapply(train[,-1], mean),train.sd = lapply(train[,-1], sd),test.mean = lapply(test[,-1], mean),test.sd = lapply(test[,-1], sd))
             train.mean    train.sd test.mean    test.sd  
Sepal.Length 2.917201e-16  1        0.09530032   0.9208225
Sepal.Width  2.444938e-16  1        -0.005609797 0.8804698
Petal.Length -1.120497e-16 1        0.0977674    0.9124194
Petal.Width  1.08351e-16   1        0.09497264   0.9345891

Also keep in mind there are many frameworks in R that can make that automatically, like caret and mlr. Also, if your algorithm includes internal normalization (center/scale) then this will be done without your interference.