For my project I want to group some products by using a few variables. For grouping, I am using k-means clustering. One of my variables is a metric called CR (conversion rate) which takes values ranging from 0 to some positive integer (the upper bound is not 1 due to the tracking algorithm).
After removing some extreme values the distribution of CR is like (which looks like a power distribution to me):
Therefore to process my data, first I took the log scale and then applied min max scaling. However, to be able to take the log scale I had to remove all the observations for which CR = 0. After the preprocessing, I got the second graph where the distribution looks more like a normal distribution:
So my questions are:
For a variable with such a distribution what other transformation methods can be used, preferably some method doesn’t require removing all those zero values (so that I don’t lose information.)
- What are some reliable methods to check if my variable follows a power distribution?
- Can I say that transformed CR has a normal distribution?
- Do I actually need a log transformation for CR to use it as an input in k-means clustering (I know that I need to apply min max scaler cause my variables are of different scales)?
thank you!
Best Answer
R
package.P.S. zero values are your data as well, if you don't think they're invalid data. Using a transform like $\log(1+x)$ helps embracing them as well, instead of $\log(x)$.