Solved – In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbours. Why not manhattan distance

distancek nearest neighbourk-meansmachine learningmetric

In k-means or kNN, we use euclidean distance to calculate the distance between nearest neighbours. Why not manhattan distance ?

Best Answer

No, KNN is generic and you can use any valid metric you want. For example, cosine distance is another metric that is used frequently. Here is an implementation in scikit-learn where you can choose among several distance options. You can also define your own metric to use.

K-means is slightly different. It really uses Euclidean distance, and it becomes a harder problem for generic metrics. However, k-medians is a variation of k-means with L1 distance. it has variations such as k-medians (using L1 distance). k-medoids is another generalization of this algorithm where the cluster center is chosen among the data points and you can use any metric with it.

Related Solutions

Solved – What are the use cases related to cluster analysis of different distance metrics

Be careful when mixing arbitrary distance functions with k-means.

K-means does not use Euclidean distance. That is a common misconception. K-means assigns points so that the variance contribution is minimized. I.e. $(x_i - \mu_i)^2$ for all dimensions $i$. But if you sum up all these contributions, you get squared Euclidean distance, and since $\sqrt{}$ is monotone, you can just as well assign to the closest neighbor by Euclidean distance (not computing the square roots is faster, though).

The bigger issue when mixing k-means with other distance functions actually is the mean. The way k-means updates the mean works for variance. I.e. the mean is the best estimation to minimize total variance. But that does not imply it will be the best estimation for minimizing an arbitrary other distance function! (see e.g. this counter-example, where the mean is suboptimal for EMD, and counter-example for absolute pearson correlation)

Usually, in situations where you would want to use a different distance function than Euclidean distance - for example because of high dimensionality or discrete data - you will not want to use k-means for the very same reasons. For example, because the mean does not make much sense if you have sparse vectors, or binary vectors (because it won't be binary).

For other distance functions, have a look at k-medoids.

Solved – Why does k-means clustering algorithm use only Euclidean distance metric

K-Means procedure - which is a vector quantization method often used as a clustering method - does not explicitly use pairwise distances between data points at all (in contrast to hierarchical and some other clusterings which allow for arbitrary proximity measure). It amounts to repeatedly assigning points to the closest centroid thereby using Euclidean distance from data points to a centroid. However, K-Means is implicitly based on pairwise Euclidean distances between data points, because the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. The term "centroid" is itself from Euclidean geometry. It is multivariate mean in euclidean space. Euclidean space is about euclidean distances. Non-Euclidean distances will generally not span Euclidean space. That's why K-Means is for Euclidean distances only.

But a Euclidean distance between two data points can be represented in a number of alternative ways. For example, it is closely tied with cosine or scalar product between the points. If you have cosine, or covariance, or correlation, you can always (1) transform it to (squared) Euclidean distance, and then (2) create data for that matrix of Euclidean distances (by means of Principal Coordinates or other forms of metric Multidimensional Scaling) to (3) input those data to K-Means clustering. Therefore, it is possible to make K-Means "work with" pairwise cosines or such; in fact, such implementations of K-Means clustering exist. See also about "K-means for distance matrix" implementation.

It is possible to program K-means in a way that it directly calculate on the square matrix of pairwise Euclidean distances, of course. But it will work slowly, and so the more efficient way is to create data for that distance matrix (converting the distances into scalar products and so on - the pass that is outlined in the previous paragraph) - and then apply standard K-means procedure to that dataset.

Please note I was discussing the topic whether euclidean or noneuclidean dissimilarity between data points is compatible with K-means. It is related to but not quite the same question as whether noneuclidean deviations from centroid (in wide sense, centre or quasicentroid) can be incorporated in K-means or modified "K-means".

Best Answer

Related Solutions

Solved – What are the use cases related to cluster analysis of different distance metrics

Solved – Why does k-means clustering algorithm use only Euclidean distance metric

Related Question