Solved – What are the use cases related to cluster analysis of different distance metrics

clusteringdistancedistance-functionsk-meansmetric

I'm trying to use different distance metrics like Euclidean, Manhattan, cosine, chebyshev among other distance metrics in my k-means algorithm to calculate distances between the data points and the centers. In what situation would one distance metric be more useful over the other in a clustering scenario? [Comparing all the above mentioned distance metrics]

Best Answer

Be careful when mixing arbitrary distance functions with k-means.

K-means does not use Euclidean distance. That is a common misconception. K-means assigns points so that the variance contribution is minimized. I.e. $(x_i - \mu_i)^2$ for all dimensions $i$. But if you sum up all these contributions, you get squared Euclidean distance, and since $\sqrt{}$ is monotone, you can just as well assign to the closest neighbor by Euclidean distance (not computing the square roots is faster, though).

The bigger issue when mixing k-means with other distance functions actually is the mean. The way k-means updates the mean works for variance. I.e. the mean is the best estimation to minimize total variance. But that does not imply it will be the best estimation for minimizing an arbitrary other distance function! (see e.g. this counter-example, where the mean is suboptimal for EMD, and counter-example for absolute pearson correlation)

Usually, in situations where you would want to use a different distance function than Euclidean distance - for example because of high dimensionality or discrete data - you will not want to use k-means for the very same reasons. For example, because the mean does not make much sense if you have sparse vectors, or binary vectors (because it won't be binary).

For other distance functions, have a look at k-medoids.