Solved – Is cosine similarity identical to l2-normalized euclidean distance

cosine distancecosine similarityeuclideannatural languagenormalization

Identical meaning, that it will produce identical results for a similarity ranking between a vector u and a set of vectors V.

I have a vector space model which has distance measure (euclidean distance, cosine similarity) and normalization technique (none, l1, l2) as parameters. From my understanding, the results from the settings [cosine, none] should be identical or at least really really similar to [euclidean, l2], but they aren't.

There actually is a good chance the system is still buggy — or do I have something critical wrong about vectors?

edit: I forgot to mention that the vectors are based on word counts from documents in a corpus. Given a query document (which I also transform in a word count vector), I want to find the document from my corpus which is most similar to it.

Just calculating their euclidean distance is a straight forward measure, but in the kind of task I work at, the cosine similarity is often preferred as a similarity indicator, because vectors that only differ in length are still considered equal. The document with the smallest distance/cosine similarity is considered the most similar.

Best Answer

For $\ell^2$-normalized vectors $\mathbf{x}, \mathbf{y}$, $$||\mathbf{x}||_2 = ||\mathbf{y}||_2 = 1,$$ we have that the squared Euclidean distance is proportional to the cosine distance, \begin{align} ||\mathbf{x} - \mathbf{y}||_2^2 &= (\mathbf{x} - \mathbf{y})^\top (\mathbf{x} - \mathbf{y}) \\ &= \mathbf{x}^\top \mathbf{x} - 2 \mathbf{x}^\top \mathbf{y} + \mathbf{y}^\top \mathbf{y} \\ &= 2 - 2\mathbf{x}^\top \mathbf{y} \\ &= 2 - 2 \cos\angle(\mathbf{x}, \mathbf{y}) \end{align} That is, even if you normalized your data and your algorithm was invariant to scaling of the distances, you would still expect differences because of the squaring.