Solved – Difference between Log Transformation and Standardization

descriptive statisticsfeature-scalingk-meansmachine learningstandardization

Is there any difference between the log transformation and standardization of data before subjecting the data to a machine learning algorithm (say k-means clustering)?

It looks like a common approach in preprocessing for clustering algorithms is to first un-skew the data through the log transformation and then perform standardization. My question is, don't both of these methods achieve the same effect when it comes to un-skewing the data? That is, both seem to transform the data into a normal distribution.

I do understand that standardization forces zero mean and unit variance but is it really required to perform both these methods on a single dataset?

So where do these two preprocessing techniques differ?

Best Answer

These two methods don't transform the data into normal distribution. And, they're very different.

  • Standardization is just making the feature zero-mean and unit variance. e.g. if the feature is uniformly distributed, it'll again be uniformly distributed. It's just a linear transform, and it doesn't decrease the skew (i.e. skewness, which is already the third standardized moment).
  • Log-transform decreases skew in some distributions, especially with large outliers. But, it may not be useful as well if the original distributed is not skewed. Also, log transform may not be applied to some cases (negative values), but standardization is always applicable (except $\sigma=0$).

The aim of stacking them together might be standardisation of all the features following the feature generation process.