Neural Networks – Input Data Normalization and Centering

machine learningneural networksnormalization

I'm learning Neural Networks and I grasped the algebra behind them. I'm now interested in understanding how normalization and centering of the input data affect them. In my personal learning project (Regression with NN) I've transformed my input variables to a range between 0 and 1 using the following function:

normalize <- function(x) {return((x - min(x))/ (max(x) - min(x)))}

The NN model fits well and has an acceptable out-of-sample prediction error.

However, I read in other questions that scaling the inputs to have mean 0 and a variance of 1 is advised for NN. I don't fully understand:

how this transformation works better for NN against the min-max normalization between 0 and 1.
how can I assess which transformation to apply in my data?

Best Answer

how this transformation works better for NN against the min-max normalization between 0 and 1.

There isn't a hard-and-fast rule about which is better; this is context-dependent. For example, people training auto-encoders for MNIST commonly use $[0,1]$ scaling and use a variant of the log-loss; you can't use the log-loss variant in conjunction with $z$ scaling because taking the log of a negative number doesn't yield a real number. On the other hand, different problems might favor different scaling schemes for similarly idiosyncratic reasons.

how can I assess which transformation to apply in my data?

Scaling is important because it preconditions the data to facilitate optimization. Putting the features on the same scale stretches the optimization surface to ameliorate narrow valleys, because these valleys make optimization very challenging, especially optimization using gradient descent. A choice of scaling is "correct" to the extent that your choice of scaling makes optimization go more smoothly. Using a scaling method that produces values on both sizes of zero, such as $z$ scaling or $[-1,1]$ scaling is preferred (if you're not in a setting similar to that of using BCE loss for an auto-encoder). From the Neural Network FAQ:

But standardizing input variables can have far more important effects on initialization of the weights than simply avoiding saturation. Assume we have an MLP with one hidden layer applied to a classification problem and are therefore interested in the hyperplanes defined by each hidden unit. Each hyperplane is the locus of points where the net-input to the hidden unit is zero and is thus the classification boundary generated by that hidden unit considered in isolation. The connection weights from the inputs to a hidden unit determine the orientation of the hyperplane. The bias determines the distance of the hyperplane from the origin. If the bias terms are all small random numbers, then all the hyperplanes will pass close to the origin. Hence, if the data are not centered at the origin, the hyperplane may fail to pass through the data cloud. If all the inputs have a small coefficient of variation, it is quite possible that all the initial hyperplanes will miss the data entirely. With such a poor initialization, local minima are very likely to occur. It is therefore important to center the inputs to get good random initializations. In particular, scaling the inputs to $[-1,1]$ will work better than $[0,1]$, although any scaling that sets to zero the mean or median or other measure of central tendency is likely to be as good, and robust estimators of location and scale (Iglewicz, 1983) will be even better for input variables with extreme outliers.

A second benefit of scaling is that it can prevent units from saturating early in training. Sigmoid, tanh and softmax functions have horizontal asymptotes, so very large and very small inputs have small gradients. If training starts with these units at saturation, then optimization will proceed more slowly because the gradients are so shallow. (Effect of rescaling of inputs on loss for a simple neural network)

Which scaling method works best depends on the problem, because different problems have different optimization surfaces. A very general strategy is to carry out an experiment: test how well the model works with alternative methods. This can be expensive, though, since the scaling will interact with other model configuration choices, such as the learning rate, effectively meaning that you'll be testing all model configurations for all scaling choices. This can be tedious, so it's typical to pick a simple method that works "well enough" for some problem and focus on more interesting considerations.

Scaling using the min and max can be extremely sensitive to outliers: if there is even one value orders of magnitude larger or smaller than the rest of the data, then the denominator is very large. As a result, scaling will clump the rest of the data in a narrow segment of the $[0,1]$ or $[-1,1]$ interval, so the range used by most of the data is much narrower.

A single large outlier will strongly influence the denominator of the scaling even for $z$ scales, but the larger the sample size, the less and less that influence is present. On the other hand, methods using the max and min will always be strongly influenced by a single outlier. And as the FAQ quotation notes, robust estimators will be more effective; unbiasedness isn't really a concern for this application.

Related Solutions

Solved – data normalization after dimension reduction for classification

PCA does require normalization as a pre-processing step.

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. Source: here

Would a further step of data normalization harm the data?

No, it would not harm the data. But would it be really necessary?

import numpy as np
from sklearn.decomposition import PCA

mean = [0.0, 20.0]
cov = [[1.0, 0.7], [0.7, 1000]]
values = np.random.multivariate_normal(mean, cov, 1000)

pca = PCA(n_components=1, whiten=True)
pca.fit(values)

values_ = pca.transform(values)
print np.var(values_)

The following exercise returns 1.0

Why? We are projecting two whitened features onto the first component. Let's assume that a point in the whitened space is identified by a vector ($a$) The new vector ($a'$) is the result of the transformation $$a' = |a| * \cos(\theta) = a \cdot \hat{b} $$

where we have $|a|$ is the length of $a$; and $\theta$ is the angle between the vector $a$ and the vector we are projecting onto. In this case $b$ equals $e$, the eigenvectors, that maps each row vector onto the principal component.

What is the variance of the whitened feature once projected on the principal component?

$$\sigma^2 = \frac{1}{n} \sum^n (a_i \cdot e)^2 = e^T \frac{a^Ta}{n} e$$

$e^Te = 1$ by definition (eigenvectors are unit vectors). Note that when we whitened the data, we imposed that means are zero on the feature set.

Solved – Worse accuracy with input normalization (NNs)

why are you applying normalisation? Is it because you believe it to be a necessary step or is it because you have determined based on your data that it is appropriate?

Mean centring and scaling to unit variance is commonly useful, but not universally so and so you should think about the properties of your data.

Mean centring is rarely not useful, but may be less useful for highly skewed populations where subtracting the mean is not significantly accounting for large proportions of the variance in the dataset. Median or mode centering are less common solutions but may work if they reduce total variance more than the mean.

Unit variance is less useful if the data is all on the same dynamic range and noise is correlated with magnitude. In such scenarios scaling to unit variance will magnify the apparent magnitude of variance from a low amplitude signal should be retained as lower than the variance in a high amplitude signal.

I realise this link is specifically about PCA, but it discusses when unit scaling is and isn't helpful and the lessons are more generalisable. Note, to help interpret this link bear in mind that using variance scaled data creates a correlation matrix in the first step of PCA and non-scaled data a covariance matrix. PCA on correlation or covariance?

Best Answer

Related Solutions

Solved – data normalization after dimension reduction for classification

Solved – Worse accuracy with input normalization (NNs)

Related Question