Solved – Distance function for categories in K-means

clusteringk-means

How to define a distance function when euclidean distance doesn't apply? For instance, say I have some data involves nationality. I'll probably assign a number to each nation, but for nations that have smaller difference in numbers doesn't mean that they are more prone to be in the same cluster as nations that have bigger difference in numbers.

Is it make sense if I just define a function that return 0 if two nations are the same, and return some positive integer otherwise? If so, how big that positive integer should be?

Best Answer

You cannot use k-means then.

You don't only need to have a working distance function, but you also need to have a way of computing means that is appropriate for the distance function.

The arithmetic mean and the Euclidean distance work together. Their combination makes k-means terminate: updating the means reduces variance, and reassigning points also, thus it will converge.

However, what would be the mean of "american, canadian, canadian, chilean, chinese, chinese, american"?

Sorry, but k-means is only sensible for euclidean vector spaces where the distance and mean play together well. And it has other limitations, e.g. assuming that clusters are approximately equal in size and linear separable.