Solved – Why use the cosine distance for machine translation (Mikolov paper)

cosine similaritymachine-translationword embeddingsword2vec

I am currently reading the paper "Exploiting Similarities among Languages for Machine Translation" by Mikolov et al. (available here : https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44931.pdf) and I was wondering why they used the cosine similarity to find the closest word to z (page 4, after equation (3)) instead of a more classic distance (like the squared sum of differences of each component).

So my question is large : why this distance since when computing the matrix W, it should act as a rotation and a scaling ? And is there any record of using word embeddings with different distance metrics and their results ?

Best Answer

I think it's still very much an open question of which distance metrics to use for word2vec when defining "similar" words. Cosine similarity is quite nice because it implicitly assumes our word vectors are normalized so that they all sit on the unit ball, in which case it's a natural distance (the angle) between any two. As well, words that are similar tend to have vectors be close to each-other, especially in length, which means that their magnitudes are comparable, and so again cosine distance becomes natural.

In reality this is much more complex, because word2vec does not explicitely require that the embedding vectors all have length 1. Indeed there is work that shows that there is important information hidden in the lengths of vectors, so that L2 distance can be used. See here for example:

https://arxiv.org/pdf/1508.02297v1.pdf

Related Solutions

Solved – Cosine distance with latitude and longitude

Because latitude and longitude are circular coordinates, some care is needed.

A simple solution is to convert them to geocentric Cartesian coordinates. For most purposes the usual conversion from spherical to Cartesian coordinates works just fine. A highly accurate calculation is included in my post at https://gis.stackexchange.com/a/34534/664; the key code is this:

ellipsoidToCartesian[{lon_, lat_}, {a_,b_}] := 
    {a Cos[lat] Cos[lon], a Cos[lat] Sin[lon], b Sin[lat]};
cartesianToEllipsoid[{x_, y_, z_}, {a_,b_}] := 
    {ArcTan[x, y], ArcTan[Norm[{x, y}]/a, z/b]};

(This is written in Mathematica. It serves as pseudocode for implementation in other environments, but pay attention to the order of arguments to ArcTan.)

The values of a and b are the planet's semi-axes. For modern Earth coordinate systems, such as WGS84, $a = 6\,378\,137.0$ and $b \approx 6\,356\,752.314\,245$ meters. When adopting a spherical approximation, use the Authalic radius of $6\,371\,007.2$ meters--but feel free to rescale this radius if you wish to adjust the relative weight of your coordinates within the overall analysis.

If you also have height or depth data coordinates relative to the planet's surface, refer to that post for details.

Word Embeddings – Should Cosine or Dot Similarity Be Used Inside Word2Vec’s Neural Network

If you normalize the features, dot product is the same as cosine similarity. As for the general answer, there is usually no single approach that you should always use. Usually, the choice is an empirical one: you try different ones and compare the results. Sometimes, in practice, the choice is arbitrary: assuming that different metrics are quite similar and give very similar results, you just pick one.

Best Answer

Related Solutions

Solved – Cosine distance with latitude and longitude

Word Embeddings – Should Cosine or Dot Similarity Be Used Inside Word2Vec’s Neural Network

Related Question