High Dimensional Data – Why Euclidean Distance Is Not a Good Metric

clusteringdistance-functionshigh-dimensionalmachine learningmetric

I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high dimensions'? I have been applying hierarchical clustering using Euclidean distance with 100 features. Up to how many features is it 'safe' to use this metric?

Best Answer

A great summary of non-intuitive results in higher dimensions comes from "A Few Useful Things to Know about Machine Learning" by Pedro Domingos at the University of Washington:

[O]ur intuitions, which come from a three-dimensional world, often do not apply in high-dimensional ones. In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. And if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere. This is bad news for machine learning, where shapes of one type are often approximated by shapes of another.

The article is also full of many additional pearls of wisdom for machine learning.

Another application, beyond machine learning, is nearest neighbor search: given an observation of interest, find its nearest neighbors (in the sense that these are the points with the smallest distance from the query point). But in high dimensions, a curious phenomenon arises: the ratio between the nearest and farthest points approaches 1, i.e. the points essentially become uniformly distant from each other. This phenomenon can be observed for wide variety of distance metrics, but it is more pronounced for the Euclidean metric than, say, Manhattan distance metric. The premise of nearest neighbor search is that "closer" points are more relevant than "farther" points, but if all points are essentially uniformly distant from each other, the distinction is meaningless.

From Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim, "On the Surprising Behavior of Distance Metrics in High Dimensional Space":

It has been argued in [Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, "When Is 'Nearest Neighbor' Meaningful?"] that under certain reasonable assumptions on the data distribution, the ratio of the distances of the nearest and farthest neighbors to a given target in high dimensional space is almost 1 for a wide variety of data distributions and distance functions. In such a case, the nearest neighbor problem becomes ill defined, since the contrast between the distances to diferent data points does not exist. In such cases, even the concept of proximity may not be meaningful from a qualitative perspective: a problem which is even more fundamental than the performance degradation of high dimensional algorithms.

... Many high-dimensional indexing structures and algorithms use the [E]uclidean distance metric as a natural extension of its traditional use in two- or three-dimensional spatial applications. ... In this paper we provide some surprising theoretical and experimental results in analyzing the dependency of the $L_k$ norm on the value of $k$. More specifically, we show that the relative contrasts of the distances to a query point depend heavily on the $L_k$ metric used. This provides considerable evidence that the meaningfulness of the $L_k$ norm worsens faster within increasing dimensionality for higher values of $k$. Thus, for a given problem with a fixed (high) value for the dimensionality $d$, it may be preferable to use lower values of $k$. This means that the $L_1$ distance metric (Manhattan distance metric) is the most preferable for high dimensional applications, followed by the Euclidean metric ($L_2$). ...

The authors of the "Surprising Behavior" paper then propose using $L_k$ norms with $k<1$. They produce some results which demonstrate that these "fractional norms" exhibit the property of increasing the contrast between farthest and nearest points. However, later research has concluded against fractional norms. See: "Fractional norms and quasinorms do not help to overcome the curse of dimensionality" by Mirkes, Allohibi, & Gorban (2020). (Thanks to michen00 for the comment and helpful citation.)

Related Question