Machine Learning Rescaling Techniques – Is L2 Normalization Same as Mean-Centering and Unit Variance?

machine learningneural networksnormal distributionnormalization

I'm following this guide on detecting anomalies using autoencoders. The section titled "Normalising & Standardising" seems to be describing normalization in terms of scaling and shifting the features to be centered around 0 with a standard deviation of 1. But the implementation in the pipeline is using sklearn.preprocessing.Normalizer, which, if I understand correctly, is the l2 vector norm that scales your features to have a l2 norm of 1.

Are these two somehow the same? Are there different "normalization" methods? If so, what should be used when?

Best Answer

The Kaggle post describes a different procedure than the code carries out. What the author is trying (but not succeeding) to say is that preconditioning can improve gradient-based optimization. Here's an answer that explains this in a more precise way. https://stats.stackexchange.com/a/437848/22311

Suppose we have some matrix $X$ where the rows store observations (examples) and the columns store features (the measurements you collect for each example).

sklearn.preprocessing.Normalizer rescales the feature vector for each observation. So if an observation $i$ has feature vector $x_i$, then after applying sklearn.preprocessing.Normalizer, we have $\| x_i \|=1 ~ \forall i$. In other words, all of the rows for $X$ have the same length. This is why all of the data points fall along a clean curve in the sklearn plot: all of the plotted points are the same distance from the origin.

enter image description here

But the sklearn.preprocessing.Normalizer is different from the "normalization" and "standardization" usages that OP describes. Indeed, most usages of "normalizing" and "standardizing" are consistent with what OP describes in their question. Usually, "normalizing" and "standardizing" are about rescaling the features themselves. In other words, these are operations that apply scaling and shifting the columns of $X$, as described in What's the difference between Normalization and Standardization? This question has answers about when to use these methods: When to Normalization and Standardization?

Intuitively, we would not expect that composing L2 row scaling and min/max scaling to be the same as scaling the columns to have 0 mean and unit variance in general. This is because L2 row scaling makes the the values in each row depend on all other values in the row. On the other hand, $z$-scores are applied to the columns alone.

A direct demonstration for this is to just apply the two transformations to the same data and compare the result.

import numpy as np
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
prng  = np.random.default_rng(12345)
X = np.random.multivariate_normal(mean=[-2, -1, 0, 1, 2], cov=np.eye(5) * np.array([0.1, 0.3, 0.5, 0.7,0.9]), size=10)

pipe1  = Pipeline([('normalizer', Normalizer()),
                 ('scaler', MinMaxScaler())])

pipe2 = Pipeline([("Standard", StandardScaler())])

X1 = pipe1.fit_transform(X)
X2 = pipe2.fit_transform(X)

print(X1 - X2)
assert np.allclose(X1, X2)

The code raises an exception because the transformations are not identical: X1 is different from X2, and the size of the differences can be very large!


Why does the author of the Kaggle post make this error?

The Kaggle text quotes Giorgos Myrianthous's answer to this Stack Overflow question that describes centering and scaling the data, which is close to what the StandardScaler does. For some reason, the Stack Overflow post uses Normalizer instead of StandardScaler. Apparently, neither Giorgos Myrianthous nor the Kaggle author bothered to read the documentation to determine which function applies centering by the mean and scaling by the standard deviation.

Also, Giorgos Myrianthous's answer describes rescaling by the variance for some reason. That doesn't make much sense because the variance is measured in units squared; StandardScaler rescales by the standard deviation, which is measured in the same units as the data. Moreover, scaling a non-constant random variable by its standard deviation gives a random variable with variance of 1. Rescaling by the variance does not do this, unless the variance is already 1.

I've demonstrated the several ways that Giorgos Myrianthous's answer is misleading in https://stackoverflow.com/a/71887356/2482661.