Distance – Understanding the Curse of Dimensions and Selecting a Distance Metric

distancemetricsimilarities

Some where I read a note that if you have many parameters $(x_1, x_2, \ldots, x_n)$ and you try to find a "similarity metric" between these vectors, you may have a "curse of dimensioality". I believe it meant that most similarity scores will be equal and not giving you any useful information. In other words almost all partner vectors will have some medium distance score which isn't useful for categorisation or clustering etc.

Do you know where I can learn more in detail about that?

Are there metrics which suffer less from this effect?

Best Answer

Some classic observations on distances in high-dimensional data:

  • K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, ICDT 1999: "When is Nearest Neighbors Meaningful?"
  • C. C. Aggarwal, A. Hinneburg, and D. A. Keim, ICDT 2001: "On the Surprising Behavior of Distance Metrics in High Dimensional Space"

A couple more recent research on this, which involves shared-nearest neighbors and hubness:

  • M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert and A. Zimek, SSDBM 2010: "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?"
  • T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert and A. Zimek, SSTD 2011: "Quality of Similarity Rankings in Time Series"
  • N. Tomašev, M. Radovanović, D. Mladenić, and M. Ivanović. Adv. KDDM 2011: "The role of hubness in clustering high-dimensional data"
  • Don't remember the others, search for "Hubness", that was their high-dimensional observation

These are interesting, as they point out some popular misunderstandings about the curse of dimensionality. In essence they show that the theoretical results - which assume the data to be i.i.d. - may not be generally true for data that has more than one distribution. The curse leads to numerical problems, and a loss of discrimination within a single distribution, while it can make it even easier to differentiate two distributions that are well separated.

Some of this should be rather obvious. Say you have objects which are $A_i\sim \mathcal{N}(0;1)$ i.i.d. in each dimension and another set of objects that are $B_i\sim \mathcal{N}(100;1)$ i.i.d. in each dimension. The difference between objects from two different sets will always be magnitudes larger than a distance within a single set, and the problem will even get easier with increasing dimensionality.

I recommend reading this work by Houle et al., largely because it shows that by claiming "this data is high-dimensional, and because of the curse of dimensionality it cannot be analyzed" you might be making things a bit too easy. Still that is a line that is being used all over the place. "Our algorithm only works only for low dimensional data, because of the curse of dimensionality." "Our index only works for up to 10 dimensions, because of the curse of dimensionality." Yadda yadda yadda. Many of these statements apparently just show that such authors have not understood what happens at high dimensionality in their data and algorithm (or needed an excuse). Houle et al. don't completely solve the puzzle (yet? this is fairly recent), but they at least reconsider many of the popular statements.

After all, if high dimensionality were this big a problem, how come that in text mining people happily use dimensionalities on the order of 10000-100000, while in other domains people give up at just 10 dimensions?!?

As for the second part of your question: cosine similarity seems to suffer less from the dimensionality. Apart from that, as long as you want to differentiate different distributions, control numerical precision and do not rely on hand-chosen thresholds (as you might need to give them with lots of significant digits), classic $L_p$-Norms should still be fine.

However, Cosine is also affected from the curse of dimensionality, as discussed in:

  • M. Radovanović, A. Nanopoulos, and M. Ivanović, SIGIR 2010. "On the existence of obstinate results in vector space models."
Related Question