Neural Networks – Input Data Normalization and Centering

machine learningneural networksnormalization

I'm learning Neural Networks and I grasped the algebra behind them. I'm now interested in understanding how normalization and centering of the input data affect them. In my personal learning project (Regression with NN) I've transformed my input variables to a range between 0 and 1 using the following function:

normalize <- function(x) {return((x - min(x))/ (max(x) - min(x)))}

The NN model fits well and has an acceptable out-of-sample prediction error.

However, I read in other questions that scaling the inputs to have mean 0 and a variance of 1 is advised for NN. I don't fully understand:

  1. how this transformation works better for NN against the min-max normalization between 0 and 1.
  2. how can I assess which transformation to apply in my data?

Best Answer

how this transformation works better for NN against the min-max normalization between 0 and 1.

There isn't a hard-and-fast rule about which is better; this is context-dependent. For example, people training auto-encoders for MNIST commonly use $[0,1]$ scaling and use a variant of the log-loss; you can't use the log-loss variant in conjunction with $z$ scaling because taking the log of a negative number doesn't yield a real number. On the other hand, different problems might favor different scaling schemes for similarly idiosyncratic reasons.

how can I assess which transformation to apply in my data?

Scaling is important because it preconditions the data to facilitate optimization. Putting the features on the same scale stretches the optimization surface to ameliorate narrow valleys, because these valleys make optimization very challenging, especially optimization using gradient descent. A choice of scaling is "correct" to the extent that your choice of scaling makes optimization go more smoothly. Using a scaling method that produces values on both sizes of zero, such as $z$ scaling or $[-1,1]$ scaling is preferred (if you're not in a setting similar to that of using BCE loss for an auto-encoder). From the Neural Network FAQ:

But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to $[-1,1]$ will work better than $[0,1]$, although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers.

A second benefit of scaling is that it can prevent units from saturating early in training. Sigmoid, tanh and softmax functions have horizontal asymptotes, so very large and very small inputs have small gradients. If training starts with these units at saturation, then optimization will proceed more slowly because the gradients are so shallow. (Effect of rescaling of inputs on loss for a simple neural network)

Which scaling method works best depends on the problem, because different problems have different optimization surfaces. A very general strategy is to carry out an experiment: test how well the model works with alternative methods. This can be expensive, though, since the scaling will interact with other model configuration choices, such as the learning rate, effectively meaning that you'll be testing all model configurations for all scaling choices. This can be tedious, so it's typical to pick a simple method that works "well enough" for some problem and focus on more interesting considerations.

Scaling using the min and max can be extremely sensitive to outliers: if there is even one value orders of magnitude larger or smaller than the rest of the data, then the denominator is very large. As a result, scaling will clump the rest of the data in a narrow segment of the $[0,1]$ or $[-1,1]$ interval, so the range used by most of the data is much narrower.

A single large outlier will strongly influence the denominator of the scaling even for $z$ scales, but the larger the sample size, the less and less that influence is present. On the other hand, methods using the max and min will always be strongly influenced by a single outlier. And as the FAQ quotation notes, robust estimators will be more effective; unbiasedness isn't really a concern for this application.

Related Question