Solved – Why do lots of people want to transform skewed data into normal distributed data for machine learning applications

data preprocessingmachine learningnormal distribution

For image and tabular data, lots of people transform the skewed data into normally distributed data during preprocessing.

What does the normal distribution mean in machine learning? Is it an essential assumption of machine learning algorithms?

Even the image data, I've seen quantile transformation, which transforms the whole pixels of an image to follow normal or uniform distribution.

I can think of one reason: to avoid the influence of outliers. But these transformation distort the original distribution of data.

Why is the normal distribution so important to machine learning that lots of preprocessing includes this step?

Best Answer

As @user2974951 says in a comment, it may be superstition that a Normal distribution is somehow better. Perhaps they have the mistaken idea that since Normal data is the result of many additive errors, if they force their data to be Normal, they can then treat the resulting numbers as having additive error. Or the first stats technique they learned was OLS regression and something about Normal was an assumption...

Normality is in general not a requirement. But whether it’s helpful depends on what the model does with the data.

For example, financial data is often lognormal -- i.e. has a multiplicative (percentage) error. Variational Autoencoders use a Normal distribution at the bottleneck to force smoothness and simplicity. Sigmoid functions work most naturally with Normal data. Mixture models often use a mixture of Normals. (If you can assume it’s Normal, you only need two parameters to completely define it, and those parameters are fairly intuitive in their meaning.)

It could also be that we want a unimodal, symmetric distribution for our modeling and the Normal is that. (And transformations to “Normal” are often not strictly Normal, just more symmetrical.)

Normality may simplify some math for you, and it may align with your conception of the process generating your data: most of your data is in the middle with relatively rarer low or high values, which are of interest.

But my impression is that it’s Cargo Cult in nature