Machine Learning – How to Get a Valid Distance Metric for Clustering and Similarity Analysis

clusteringmachine learningmetricsimilarities

I have got a problem to devise a distance metric to get the similarity measurement of vectors. Someone suggested me to use dot product, which seems to me the same as the Cosine similarity metric; however in Wikipedia (Cosine Similarity), it mentioned Cosine similarity is not a proper distance metric as it does not have the triangle inequality property and it violates the coincidence axiom (the proper distance metric should satisfy the four conditions (distance metric)).

My questions are:

  1. What are the proper distance metric? Please name some examples.

  2. Are Dice's coefficient and Jaccard index proper distance metric?

  3. Are there any disadvantages of using dot product? (One of the reasons for the popularity of dot product is that it is very efficient to evaluate).

Thanks a lot. A.

Best Answer

First of all, in many applications you do not need a distance metric, but a dissimilarity will be okay. So make sure that triangle inequality is needed.

In mathematics, triangle inequality is part of the definition of a metric, and distances in mathematics are synonymous to metrics. But in database literature, often distances are not required to be metric.

Second, we cannot recommend a metric for your data, if we don't know your data.

Third, Cosine is closely related to Euclidean distance. Assuming that all your data is normalized to unit length ($||x||=1=||y||$), then \begin{align*} \text{Euclid}^2(x,y)&=\sum_i (x_i-y_i)^2\\ &=\sum_ix^2+\sum_iy^2-2\sum_i x_iy_i\\ &=1+1-2\cdot x\cdot y\\ &=2(1-x\cdot y) \end{align*} Therefore, if your data is normalized to unit length, $$ \sqrt{1-x\cdot y} $$ is a metric. Because as just shown, $\sqrt{1-x\cdot y}=\sqrt{\frac{1}{2}}\text{Euclid}(x,y)$.

While this may get you overly excited that there is a metric based on the dot product, recall that this only holds if all your data lives on the unit circle and this is just Euclidean metric. If this is the behaviour you want, normalize your data and use Euclidean distance... Cosine distance is exactly this normalization. It includes normalization terms for the length of the vectors to ensure they are of unit length...

If your data is sparse, and you can afford to keep all vector lengths in memory, then this may be a faster way to compute Euclidean distance. If you have a sparsity of $s$, the expected sparsity of the dot product is $s^2$, so this can yield a substantial performance benefit of $1/s$, if you have a good implementation.

Update: it was pointed out to me that computing Euclidean this way can suffer from a numerical instability called "catastrophic cancellation".

Related Question