Solved – Log Transformation Instead of Z-Score Normalizatrion For Machine Learning

data transformationmachine learningnormalizationnumpypython

I almost always used Numpy's StandardScaler to normalize my data for machine learning. I noticed however that simply taking the log of the variables that I wanted to normalize often resulted in better accuracy compared to when I used the StandardScaler method.

To give some more context, I built several binary classifiers for different purposes both with ANNs and XGboost and I noticed that log-normalizing the data always leads to better accuracy.

I'm a little puzzled by this as nobody ever mentions log-normalization as a valid normalization technique. Everyone talks about min-max normalization and Z-score/Numpy's StandardScaler but no one even mentions log-normalization.

How is that possible? Am I doing something wrong?

Best Answer

It is quite often to use the log transformation on your data, if your data are always positive (e.g. the price of something) and their scales varies drastically.

A simple criterion of whether you should use log transformation is whether you want to use a linear or log scale for your x-axis when you are plotting the histogram of your data.

This is likely to make your ANN work better if your data indeed look that way, because of one reason: Remember the motivation of batch normalization - ANN likes to have a standard normal distribution. You can make your distribution zero-centered with unit variance, but that does not make your distribution into a normal distribution, but your distribution might look more like a normal distribution if you use log transformation. You can check whether this is true from the histogram or the Kurtosis of your distribution.

Related Question