Solved – Using k-means with other metrics

clusteringk-meansmetric

So I realize this has been asked before: e.g. What are the use cases related to cluster analysis of different distance metrics? but I've found the answers somewhat contradictory to what is suggested should be possible in the literature.

Recently I have read two papers that have mention using the kmeans algorithm with other metrics, for example edit distance between strings and the "Earth Mover Distance" between distributions. Given that these papers mention using kmeans with other metrics without specifying how, particularly when it comes to computing the mean of set of points, suggests to me that maybe there is some "standard" method to dealing with this that I'm just not picking up on.

Take for example this paper, which gives a faster implementation of the k-means algorithm. Quoting from paragraph 4 in the intro the author says his algorithm "can be used with any black box distance metric", and in the next paragraph he mentions edit distance as a specific example. His algorithm however still computes the mean of a set of points and doesn't mention how this might affect results with other metrics (I'm especially perplexed as to how mean would work with edit distance).

This other paper describes using k-means to cluster poker hands for a texas hold-em abstraction. If you jump to page 2 bottom of lefthand column the author's write "and then k-means is used to compute an abstraction with the desired number of clusters using the Earth Mover Distance between each pair of histograms as the distance metric".

I'm not really looking for someone to explain these papers to me, but am I missing some standard method for using k-means with other metrics? Standard averaging with the earth mover distance seems like it could work heuristically, but edit distance seems to not fit the mold at all. I appreciate any insight someone could give.

(edit): I went ahead and tried k-means on distribution histograms using the earth mover distance (similar to what is in the poker paper) and it seemed to have worked fine, the clusters it output looked pretty good for my use case. For averaging I just treated the histograms as vectors and averaged in the normal way. The one thing that I noticed is the sum over all points of the distances to the means did not always decrease in a monotone manner. In practice though, it would settle on a local min within 10 iterations despite monotone issues. I'm going to assume that this is what they did in the second paper, the only question that remains then is, how the heck would you average when using something like edit distance?

Best Answer

It's not as if k-means will necessarily blow up and fail if you use a different metric.

In many cases it will return some result. It is just not guaranteed that it finds the optimum centroids or partitions with other metrics, because the mean may not be suitable for minimizing distances.

Consider Earth movers distance. Given the three vectors

3 0 0 0 0
0 0 3 0 0
0 0 0 0 3

The arithmetic mean is

1 0 1 0 1

which has EMD distances 6, 4, 6 (total 16). If the algorithm had instead used

0 0 3 0 0

the EMD distances would have been 6, 0, 6; i.e. better (total 12).

The arithmetic mean does not minimize EMD, and the result of using k-means (with artihmetic mean) will not yield optimal representatives.

Similar things will hold for edit distances.

Related Question