Machine Learning – How to Scale a Power Law Distribution for K-Means Clustering

data transformationfeature-scalingk-meansmachine learning

For my project I want to group some products by using a few variables. For grouping, I am using k-means clustering. One of my variables is a metric called CR (conversion rate) which takes values ranging from 0 to some positive integer (the upper bound is not 1 due to the tracking algorithm).

After removing some extreme values the distribution of CR is like (which looks like a power distribution to me):

Therefore to process my data, first I took the log scale and then applied min max scaling. However, to be able to take the log scale I had to remove all the observations for which CR = 0. After the preprocessing, I got the second graph where the distribution looks more like a normal distribution:

So my questions are:

For a variable with such a distribution what other transformation methods can be used, preferably some method doesn’t require removing all those zero values (so that I don’t lose information.)

  1. What are some reliable methods to check if my variable follows a power distribution?
  2. Can I say that transformed CR has a normal distribution?
  3. Do I actually need a log transformation for CR to use it as an input in k-means clustering (I know that I need to apply min max scaler cause my variables are of different scales)?
    thank you!

Best Answer

  1. One method is a goodness of fit test as described here, Section 1.2. This is an R package.
  2. It doesn't seem like it's gaussian/normal. Try overlaying the normal PDF over the normalized histogram or use Q-Q plots to better see it. You can reinforce your decisions with other statistical methods. Besides, logarithm of power-law distributed variable is not normal. That property belongs to log-normal distribution.
  3. K-means does not require you to apply any transformation. It may benefit from different transformations/features, but this is not set in stone, as it is the case for almost all other ML algorithms.

P.S. zero values are your data as well, if you don't think they're invalid data. Using a transform like $\log(1+x)$ helps embracing them as well, instead of $\log(x)$.

Related Question