Feature Scaling – Data Distribution and Feature Scaling Techniques in Machine Learning

datasetdistributionsmachine learningnormalizationstandardization

New to AI/ML. My understanding of feature scaling is that its a set of techniques used to counteract the effects of different features having different scales/ranges (which then causes models to incorrectly weight them more/less).

The two most common techniques here that I keep reading about are normalization (adjusting your feature values between 0 and 1) and standardization (adjusting your feature values to have a 0 mean and standard deviation of 1).

From what I can gather, normalization seems to work better for when your data is non-Gaussian/"Bell Curve", whereas standardization is better when it is Gaussian. But nowhere can I find a decent explanation as to why this is the case!

Why does your data distribution affect the efficacy of your feature scaling technique? Why is normalization good for non-Gaussian whereas standardization is? Any edge cases where you'd use standardization on non-Gaussian data? Any other major techniques besides these two?

For instance, I found this excellent paper on characterizing datasets by various distributions. So I'm wondering if there are methods for feature scaling when the data is, say, geometrically distributed, or when its exponentially distributed, etc. And if so, what are they?!

Best Answer

I cannot speak in terms of machine learning, but I can speak in terms of scaling.

From our tag wiki:

tl;dr version first:

refers to scaling all numeric variables in the range [0,1], such as using the formula: $$x_{new}=\frac{x-x_{min}}{x_{max}-x_{min}}$$

refers to a transform to the data set to have zero mean and unit variance, for example using the equation: $$x_{new}=\frac{x-\overline{x}}{s}$$

That is, does not rely on the underlying distribution; transforms the data based upon the parameters of a Gaussian distribution.

Fuller explanations:

"Normalization" refers to several related processes:

  • ("Feature scaling") A set of numbers whose maximum is $M$ and minimum is $m$ can be converted to the range from $0$ to $1$ by means of an affine transformation (which amounts to changing their units of measurement) $x \to (x-m)/(M-m)$.

  • A set of positive numbers $\{p_i\}$ representing probabilities or weights can be uniformly rescaled to sum to unity: divide each $p_i$ by the sum of all the $p_i$.

  • Analogously, a distribution (or indeed any non-negative function with a finite nonzero integral) can be normalized to have a unit integral by dividing its values by the integral.

  • A vector in a normed linear space is normalized (to unit length) by dividing it by its norm. This is a general procedure encompassing the two preceding operations as special examples.

The range from $0$ to $1$ can be made from $0$ to any desired limit $\alpha$ by multiplying a previously unit-normalized value by $\alpha$.

Other kinds of operations exist having a similar intent of re-expressing values in a predetermined range. Many of these are nonlinear and tend to be used in specialized settings.

Standardization:

Shifting and rescaling data to assure zero mean and unit variance.

Specifically, when $(x_i), i=1, \ldots, n$ is a batch of data, its mean is $m=(\sum_i x_i)/n$ and its variance is $s^2 = > v=(\sum_i(x_i-m)^2)/\nu$ where $\nu$ is either $n$ or $n-1$ (choices vary with application). Standardization replaces each $x_i$ with $z_i > = (x_i-m)/s$.

Related Question