Solved – What are the benefits of L2 data normalization over other normalization techniques for deep learning

I've seen several ways to transform the data before feeding it into a neural network:

Making it to fit the [0.; 1.] range by subtracting the min and dividing by (max - min);
Making it zero-centered with unit variance by subtracting the mean and dividing by standard deviation.

keras.utils package also has its own normalization function normalize(x, axis=-1, order=2), but the rationale, results and the purpose of this kind normalization is not clear to me. Moreover, this function can compute other kinds of norms, L1, L0 etc.

I've taken the MNIST dataset as an example. If we plot the data distribution for the first two normalization kinds, here's what we get:

The majority of the pixels have the value of 0. The statistics for the data are:

DescribeResult(nobs=47040000, minmax=(0.0, 1.0), mean=0.13066062, variance=0.09493039, skewness=2.151106357574463, kurtosis=2.9144181812310546)

But if we look at the plot of L2-normalized data, it looks totally different:

The statistics for L2-normalized data:

DescribeResult(nobs=47040000, minmax=(0.0, 0.27902707), mean=0.013792945, variance=0.0010852652, skewness=2.296393632888794, kurtosis=3.9849276836080234)

It looks like the data distribution has changed significantly.

I don't have enough knowledge in Linear Algebra to understand what has happened, and what L2-normalization actually does. So is it relevant to apply this kind of normalization to the input data, does it have any benefits and why should one prefer this way of normalization over traditional ones for training neural networks?

Best Answer

The L2 normalization takes an input vector and converts it to a unit vector. So each MNIST image is interpreted as a 784 dimensional vector, and is projected onto a point on the 784-dimensional sphere.

L2 normalization can be useful when you want to force learned embeddings to lie on a sphere or something like that, but I'm not sure this function is intended for use in a data preprocessing scenario like you describe.

The function, using the default axis, normalizes each data-point separately, in contrast to most scenarios where you use the max/min of the entire training set to perform data normalization. It's prone to losing information about the scale of inputs -- you can no longer tell the datapoint (3,5) apart from the datapoint (6,10).

If used on the 0th axis, the normalization would change depending on how many datapoints you had, which doesn't make much sense to me.

Best Answer

Related Solutions

Solved – High kurtosis and bad skewness

Solved – Batch normalization – how to compute mean and standard deviation

Related Question