Natural Language – Why Word2Vec Maximizes Cosine Similarity Between Semantically Similar Words

natural languageword2vec

I have an understanding into the technicals of word2vec. What I don't understand is:

Why semantically similar words should have high cosine similarity. From what I know, goodness of a particular embedding is seen in shallow tasks such as word analogy. I am unable to grasp the relationship between maximizing cosine similarity and good word embeddings.

Best Answer

Why semantically similar words should have high cosine similarity.

From wikipedia on distributional semantics:

The distributional hypothesis in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings.[1] The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth.

Why exactly cosine similarity? Because apart from being a similarity, which is in itself useful, it is related to euclidean distance: if $$\|x\| = \|y\| = 1$$ then $$\|x-y\|^2 = 2 - 2 \langle x, y\rangle$$, because

$$\|x-y\|^2 = \langle x-y, x-y \rangle = \|x\|^2 + \|y\|^2 - 2 \langle x, y\rangle$$

To sum up: word2vec and other word embedding schemes that tend to have high cosine similarity for words that occur in similar context - that is, they translate words which are similar semantically to vectors that are similar geometrically in euclidean space (which is really useful, since many machine learning algorithms exploit such structure).