Solved – Rescaling exponentially distributed variables before clustering

clusteringexponential distributionpower lawstandardization

I want to cluster data that contains binary variables, exponentially distributed (power law) variables, and normally distributed variables. I'm considering preprocessing the data in the following way and wondering whether it's reasonable.

1) shift the binary variables so that they have mean zero. no rescaling.

2) standardize the normally distributed variables, but divide by twice the standard deviation rather than by once. this is based on Gelman, A. "Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine 2008 27:2865-2837. [for clarification: their point is that binary variables can be interpreted directly as indicators and can be left unscaled. But in order for numerical inputs to be interpreted in the same way, they should be divided by twice the standard deviation — that way, both numerical and binary variables have similar standard deviations (if the binary indicators are not strongly skewed; if they are, then there is not much improvement, but also no harm).]

3) take log(x+1) [or sqrt(x)] for any exponentially distributed variable x, and standardize in the above way.

Mainly, I'm worried about (3). I'm told that it's standard practice to simply standardize exponentially distributed variables. Is it standard practice because it's correct or because it's a good enough approximation? Also, would it be better to apply k-medoid clustering rather than k-means?

Best Answer

1.) Binary variables are a problem for many many algorithms. While mapping them to two numerical values $x_0,x_1\in\mathbb{R}$ works, the results are often "damaged" by not having any value inbetween; similar to the effects resulting from discrete values such as integers.

2.) Standardization is a best practise when dealing with different scales. Divding each attribute by two standard deviations instead of one standard deviation will not make any fundamental difference. It will just half all your distances.

3.) $\log(x+1)$ often works, but so does $\sqrt{x}$. Without theoretical support, either is just a guess to make things work. Standardization is still reasonable, as discussed above.

The key question you should be asking is how do I reasonably measure similarity? Don't start with how do I make k-means work on this data set. First find out what the problem is that you are trying to solve, then find the appropriate tools to do so. Don't let the tool define your problem.