I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high dimensions'? I have been applying hierarchical clustering using Euclidean distance with 100 features. Up to how many features is it 'safe' to use this metric?
High Dimensional Data – Why Euclidean Distance Is Not a Good Metric
clusteringdistance-functionshigh-dimensionalmachine learningmetric
Related Solutions
Some classic observations on distances in high-dimensional data:
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, ICDT 1999: "When is Nearest Neighbors Meaningful?"
- C. C. Aggarwal, A. Hinneburg, and D. A. Keim, ICDT 2001: "On the Surprising Behavior of Distance Metrics in High Dimensional Space"
A couple more recent research on this, which involves shared-nearest neighbors and hubness:
- M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert and A. Zimek, SSDBM 2010: "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?"
- T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert and A. Zimek, SSTD 2011: "Quality of Similarity Rankings in Time Series"
- N. Tomašev, M. Radovanović, D. Mladenić, and M. Ivanović. Adv. KDDM 2011: "The role of hubness in clustering high-dimensional data"
- Don't remember the others, search for "Hubness", that was their high-dimensional observation
These are interesting, as they point out some popular misunderstandings about the curse of dimensionality. In essence they show that the theoretical results - which assume the data to be i.i.d. - may not be generally true for data that has more than one distribution. The curse leads to numerical problems, and a loss of discrimination within a single distribution, while it can make it even easier to differentiate two distributions that are well separated.
Some of this should be rather obvious. Say you have objects which are $A_i\sim \mathcal{N}(0;1)$ i.i.d. in each dimension and another set of objects that are $B_i\sim \mathcal{N}(100;1)$ i.i.d. in each dimension. The difference between objects from two different sets will always be magnitudes larger than a distance within a single set, and the problem will even get easier with increasing dimensionality.
I recommend reading this work by Houle et al., largely because it shows that by claiming "this data is high-dimensional, and because of the curse of dimensionality it cannot be analyzed" you might be making things a bit too easy. Still that is a line that is being used all over the place. "Our algorithm only works only for low dimensional data, because of the curse of dimensionality." "Our index only works for up to 10 dimensions, because of the curse of dimensionality." Yadda yadda yadda. Many of these statements apparently just show that such authors have not understood what happens at high dimensionality in their data and algorithm (or needed an excuse). Houle et al. don't completely solve the puzzle (yet? this is fairly recent), but they at least reconsider many of the popular statements.
After all, if high dimensionality were this big a problem, how come that in text mining people happily use dimensionalities on the order of 10000-100000, while in other domains people give up at just 10 dimensions?!?
As for the second part of your question: cosine similarity seems to suffer less from the dimensionality. Apart from that, as long as you want to differentiate different distributions, control numerical precision and do not rely on hand-chosen thresholds (as you might need to give them with lots of significant digits), classic $L_p$-Norms should still be fine.
However, Cosine is also affected from the curse of dimensionality, as discussed in:
- M. Radovanović, A. Nanopoulos, and M. Ivanović, SIGIR 2010. "On the existence of obstinate results in vector space models."
For high-dimensional data, shared-nearest-neighbor distances have been reported to work in
Houle et al., Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? Scientific and Statistical Database Management. Lecture Notes in Computer Science 6187. p. 482. doi:10.1007/978-3-642-13818-8_34
Fractional distances are known to be not metric. $L_p$ is only a metric for $p\geq 1$, you'll find this restriction in every proof of the metric properties of Minkowski norms.
Best Answer
A great summary of non-intuitive results in higher dimensions comes from "A Few Useful Things to Know about Machine Learning" by Pedro Domingos at the University of Washington:
The article is also full of many additional pearls of wisdom for machine learning.
Another application, beyond machine learning, is nearest neighbor search: given an observation of interest, find its nearest neighbors (in the sense that these are the points with the smallest distance from the query point). But in high dimensions, a curious phenomenon arises: the ratio between the nearest and farthest points approaches 1, i.e. the points essentially become uniformly distant from each other. This phenomenon can be observed for wide variety of distance metrics, but it is more pronounced for the Euclidean metric than, say, Manhattan distance metric. The premise of nearest neighbor search is that "closer" points are more relevant than "farther" points, but if all points are essentially uniformly distant from each other, the distinction is meaningless.
From Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim, "On the Surprising Behavior of Distance Metrics in High Dimensional Space":
The authors of the "Surprising Behavior" paper then propose using $L_k$ norms with $k<1$. They produce some results which demonstrate that these "fractional norms" exhibit the property of increasing the contrast between farthest and nearest points. However, later research has concluded against fractional norms. See: "Fractional norms and quasinorms do not help to overcome the curse of dimensionality" by Mirkes, Allohibi, & Gorban (2020). (Thanks to michen00 for the comment and helpful citation.)