Solved – Why use the cosine distance for machine translation (Mikolov paper)

cosine similaritymachine-translationword embeddingsword2vec

I am currently reading the paper "Exploiting Similarities among Languages for Machine Translation" by Mikolov et al. (available here : https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44931.pdf) and I was wondering why they used the cosine similarity to find the closest word to z (page 4, after equation (3)) instead of a more classic distance (like the squared sum of differences of each component).

So my question is large : why this distance since when computing the matrix W, it should act as a rotation and a scaling ? And is there any record of using word embeddings with different distance metrics and their results ?

Best Answer

I think it's still very much an open question of which distance metrics to use for word2vec when defining "similar" words. Cosine similarity is quite nice because it implicitly assumes our word vectors are normalized so that they all sit on the unit ball, in which case it's a natural distance (the angle) between any two. As well, words that are similar tend to have vectors be close to each-other, especially in length, which means that their magnitudes are comparable, and so again cosine distance becomes natural.

In reality this is much more complex, because word2vec does not explicitely require that the embedding vectors all have length 1. Indeed there is work that shows that there is important information hidden in the lengths of vectors, so that L2 distance can be used. See here for example:

https://arxiv.org/pdf/1508.02297v1.pdf

Related Question