Solved – A valid distance metric for high dimensional data

clusteringmachine learningmetricsimilarities

I asked a question about forming a valid distance metric yesterday (Link1) and got some very good answer; however, I have got some more questions about forming a proper distance metric for high dimensional data.

  1. Why is triangle inequality so important to make a valid distance metric? Maybe it is too broad to ask this, but I haven't got a simple example in my mind. Not sure if you people can think a simple scenario to explain this with some context?

  2. As mentioned in my previous post (Link1), I think Cosine similarity is the same thing as dot product. Am I right? If so, dot product is not a valid distance metric because it does not have the triangle inequality property and etc. If we transform the similarity measured by dot product into Angular similarity, will it be a proper distance metric?

  3. Regarding to the Euclidean distance, there is another post (Link2) saying that it is not a good metric in high dimensions. As my data vectors are in high dimensional space, I am wondering if some distance metric suffer from the curse of dimensionality?

  4. Regarding to the point C above, considering the dimensionality, will a fractional distance metric be a better distance metric? (Link3)

Thanks very much! A

Best Answer

For high-dimensional data, shared-nearest-neighbor distances have been reported to work in

Houle et al., Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? Scientific and Statistical Database Management. Lecture Notes in Computer Science 6187. p. 482. doi:10.1007/978-3-642-13818-8_34

Fractional distances are known to be not metric. $L_p$ is only a metric for $p\geq 1$, you'll find this restriction in every proof of the metric properties of Minkowski norms.

Related Question