Solved – Why does k-means clustering algorithm use only Euclidean distance metric

clusteringdistance-functionseuclideank-means

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? In general, will K-means method comply and be correct when other distances than Euclidean are considered or used?

[Addition by @ttnphns. The question is two-fold. "(Non)Euclidean distance" may concern distance between two data points or distance between a data point and a cluster centre. Both ways have been attempted to address in the answers so far.]

Best Answer

K-Means procedure - which is a vector quantization method often used as a clustering method - does not explicitly use pairwise distances between data points at all (in contrast to hierarchical and some other clusterings which allow for arbitrary proximity measure). It amounts to repeatedly assigning points to the closest centroid thereby using Euclidean distance from data points to a centroid. However, K-Means is implicitly based on pairwise Euclidean distances between data points, because the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. The term "centroid" is itself from Euclidean geometry. It is multivariate mean in euclidean space. Euclidean space is about euclidean distances. Non-Euclidean distances will generally not span Euclidean space. That's why K-Means is for Euclidean distances only.

But a Euclidean distance between two data points can be represented in a number of alternative ways. For example, it is closely tied with cosine or scalar product between the points. If you have cosine, or covariance, or correlation, you can always (1) transform it to (squared) Euclidean distance, and then (2) create data for that matrix of Euclidean distances (by means of Principal Coordinates or other forms of metric Multidimensional Scaling) to (3) input those data to K-Means clustering. Therefore, it is possible to make K-Means "work with" pairwise cosines or such; in fact, such implementations of K-Means clustering exist. See also about "K-means for distance matrix" implementation.

It is possible to program K-means in a way that it directly calculate on the square matrix of pairwise Euclidean distances, of course. But it will work slowly, and so the more efficient way is to create data for that distance matrix (converting the distances into scalar products and so on - the pass that is outlined in the previous paragraph) - and then apply standard K-means procedure to that dataset.

Please note I was discussing the topic whether euclidean or noneuclidean dissimilarity between data points is compatible with K-means. It is related to but not quite the same question as whether noneuclidean deviations from centroid (in wide sense, centre or quasicentroid) can be incorporated in K-means or modified "K-means".

See related question K-means: Why minimizing WCSS is maximizing Distance between clusters?.