I know you've got your answer but I want to clarify something...
- This is a case of reversed scale min max normalization.
That means - best value is 21.07 and worst value is 100 (for your case).
Here you should use:
$$
x_{normalized} = \frac{max(x)-x_i}{max(x)-min(x)}
$$
Example:
If your normalizing $x_i = 99$ the result should be closer to 0.
$$
x_{normalized} = \frac{100-x_i}{100-21.07}=\frac{100-99}{78.93}=0.013
$$
- In most cases the formula used for min max scaling is:
$$
x_{normalized} = \frac{x_i-min(x)}{max(x)-min(x)}
$$
Example:
If your normalizing $x_i = 99$ the result should be closer to 1.
Big values produce big results.
$$
x_{normalized} = \frac{x_i-21.07}{100-21.07}=\frac{99-21.07}{78.93}=0.987
$$
In both cases:
max(x) - represents the maximum value of the entire population (mesurements)
min(x) - represents the minimum value found in the entire population (mesurements)
Sidenote - a more generalized approach
There is a common issue in MOGA Multi-Objective Genetic Algorithm optimization where the algorithm can minimize the objective function f(x,y). In here sometimes we want to switch from minimization to maximization of the objective function. We can do that in two ways:
Revese scale with unknown range. if you don't know the range of values. Just slap a (-1) multiplication:
$$
f(x,y) = (-1)*f(x,y)
$$
Example using our normalization formula:
$$
(-1)*x_{normalized} = (-1)*0.987 = -0.987
$$
The values are negative (and harder for humans to interpret) but are in correct order of importance.
Big values $x_i = 99$ get more negative results (smaller) -0.987
Small values like $x_i = 30$ get more positive results (bigger) -0.113
Reversed scale just like you want.
Easy to interpret for computers:
$$x_{normalized}(99)<x_{normalized}(30)$$ $$-0.987 < -0.113$$
Revese scale with known range, if you know your range of function values is 0-1. $max_{range} = 1$ of our function (normalized range).
Example using our normalization formula:
$$
max_{range}-x_{normalized} = 1-x_{normalized} = 1-0.987 = 0.013
$$
The advantage of this formula is that you keep the 0-1 range.
Cudos to whuber for providing the shortest answer with a useful formula.
Hope my answer provides some light on why use this or that and how it works to next users faced with these problems.
All the best from Ro
TL;DR It is pointless to use both transformations.
Say that $X$ is your data. What you are trying to do is
$$
z = \frac{x-\mathrm{mean}(X)}{\mathrm{sd}(X)}, \qquad
y = \frac{z-\min(Z)}{\max(Z)-\min(Z)}
$$
Let us use thew $m,s,l,u$ symbols for the sample mean, standard deviation, minimum and maximum respectively. Notice that after $z$-transforming also the minimum and maximum get $z$-transformed, so $\min(Z) = \frac{l - m}{s}$ etc. Now, if we combine both equations, we have
$$
\require{cancel}
\frac{\frac{x - m}{s} - \frac{l-m}{s}}{\frac{u-m}{s} - \frac{l-m}{s}} =
\frac{\frac{x - \cancel{m} - (l - \cancel{m})}{s}}{\frac{u -\cancel{m}-(l-\cancel{m})}{s}} =
\frac{\frac{x - l}{s}}{\frac{u-l}{s}} =
\frac{x - l}{u-l}
$$
So basically, using $z$-transformation and then min-max scaling, leads to the same result as min-max scaling alone. Same can be shown about using min-max scaling and then $z$-transformation, as it gives same result as $z$-transformation alone.
See also Transform data to have specific mean, minimum and maximum?.
Best Answer
The Kaggle post describes a different procedure than the code carries out. What the author is trying (but not succeeding) to say is that preconditioning can improve gradient-based optimization. Here's an answer that explains this in a more precise way. https://stats.stackexchange.com/a/437848/22311
Suppose we have some matrix $X$ where the rows store observations (examples) and the columns store features (the measurements you collect for each example).
sklearn.preprocessing.Normalizer
rescales the feature vector for each observation. So if an observation $i$ has feature vector $x_i$, then after applyingsklearn.preprocessing.Normalizer
, we have $\| x_i \|=1 ~ \forall i$. In other words, all of the rows for $X$ have the same length. This is why all of the data points fall along a clean curve in thesklearn
plot: all of the plotted points are the same distance from the origin.But the
sklearn.preprocessing.Normalizer
is different from the "normalization" and "standardization" usages that OP describes. Indeed, most usages of "normalizing" and "standardizing" are consistent with what OP describes in their question. Usually, "normalizing" and "standardizing" are about rescaling the features themselves. In other words, these are operations that apply scaling and shifting the columns of $X$, as described in What's the difference between Normalization and Standardization? This question has answers about when to use these methods: When to Normalization and Standardization?Intuitively, we would not expect that composing L2 row scaling and min/max scaling to be the same as scaling the columns to have 0 mean and unit variance in general. This is because L2 row scaling makes the the values in each row depend on all other values in the row. On the other hand, $z$-scores are applied to the columns alone.
A direct demonstration for this is to just apply the two transformations to the same data and compare the result.
The code raises an exception because the transformations are not identical:
X1
is different fromX2
, and the size of the differences can be very large!Why does the author of the Kaggle post make this error?
The Kaggle text quotes Giorgos Myrianthous's answer to this Stack Overflow question that describes centering and scaling the data, which is close to what the
StandardScaler
does. For some reason, the Stack Overflow post usesNormalizer
instead ofStandardScaler
. Apparently, neither Giorgos Myrianthous nor the Kaggle author bothered to read the documentation to determine which function applies centering by the mean and scaling by the standard deviation.Also, Giorgos Myrianthous's answer describes rescaling by the variance for some reason. That doesn't make much sense because the variance is measured in units squared;
StandardScaler
rescales by the standard deviation, which is measured in the same units as the data. Moreover, scaling a non-constant random variable by its standard deviation gives a random variable with variance of 1. Rescaling by the variance does not do this, unless the variance is already 1.I've demonstrated the several ways that Giorgos Myrianthous's answer is misleading in https://stackoverflow.com/a/71887356/2482661.