Solved – Difference between (log, square, root) transformation and Normalization

data transformationmachine learningnormalization

I am confused between the Transformation and Normalization/Standardization, The basic understanding I have is Transformation: will be used in situation when we have skewness in data and to distribute the data to be like Gaussian. Normalization is used to rescale the data between 0 and 1 and make it Gaussian. what exactly does it differentiate the Transformation from Normalization?

Best Answer

Normalization is not from the word "normal" as in normal distribution, rather it is related to a norm concept in mathematics, which is made equal to 1. Compare this to orthonormality. Hence, normalization in data science is scaling the data in a such a way that its variance becomes 1, and the mean is zero. Sometimes, it also means using the range instead of the standard deviation.

This operation does not make data normal or Gaussian. This is scaling and shifting, and Gaussian distribution remains Gaussian after this. Which means that if after normalization/standardization your data is Gaussian, then it was already Gaussian in the first place, and that all you did was make it standard normal.

What you called "transformation" is a very generic term, which includes normalization discussed above. There are some transformations that will make some data normal. For instance, the lognormal inputs data will become normal after logarithmic transformation. The Box-Cox transformation (which includes log transform) can make some data look like normal, meaning that it will be more symmetrical bell-shaped. It's not a magic wand that will make any data Gaussian though. It doesn't always work well.

Finally, in machine learning the main reason to normalize the data is not to make it like normal distribution. The reason is related to intricacies of the optimization algorithms used. It turns out that these algorithms work best when all variables (features) are in the same scale. So, you "standardize" the input data whichever way is most appropriate, such as scaling/shifting by mean and standard deviation, scaling by range to get all data into 0 to 1 interval etc.