Word Embeddings – Should Cosine or Dot Similarity Be Used Inside Word2Vec’s Neural Network

word embeddingsword2vec

I've implemented the word2vec algorithm according to its negative sampling architecture,using a shallow neural network that performs binary classification on word-embedding vector pairs. The network is expected to output 1 for pairs that occur in the corpus and 0 for the random negative pairs. In the final neuron, my implementation calculates the dot product of the two vectors before passing it to a sigmoid activation and later down the line, cross entropy is calculated and averaged over the batch.

My question is, should I use cosine instead of dot? I'm well aware that they differ only by the normalization of the vectors, however, I was unable to find a clear answer. Does this actually affect the quality of the embedding vectors? Or the use of cosine vs dot only matters when similarity is calculated for among the trained embedding vectors?

Best Answer

If you normalize the features, dot product is the same as cosine similarity. As for the general answer, there is usually no single approach that you should always use. Usually, the choice is an empirical one: you try different ones and compare the results. Sometimes, in practice, the choice is arbitrary: assuming that different metrics are quite similar and give very similar results, you just pick one.

Related Solutions

Solved – Should I normalize word2vec’s word vectors before using them

When the downstream applications only care about the direction of the word vectors (e.g. they only pay attention to the cosine similarity of two words), then normalize, and forget about length.

However, if the downstream applications are able to (or need to) consider more sensible aspects, such as word significance, or consistency in word usage (see below), then normalization might not be such a good idea.

From Levy et al., 2015 (and, actually, most of the literature on word embeddings):

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.

Also from Wilson and Schakel, 2015:

Most applications of word embeddings explore not the word vectors themselves, but relations between them to solve, for example, similarity and word relation tasks. For these tasks, it was found that using normalised word vectors improves performance. Word vector length is therefore typically ignored.

Normalizing is equivalent to losing the notion of length. That is, once you normalize the word vectors, you forget the length (norm, module) they had right after the training phase.

However, sometimes it's worth to take into consideration the original length of the word vectors.

Schakel and Wilson, 2015 observed some interesting facts regarding the length of word vectors:

A word that is consistently used in a similar context will be represented by a longer vector than a word of the same frequency that is used in different contexts.

Not only the direction, but also the length of word vectors carries important information.

Word vector length furnishes, in combination with term frequency, a useful measure of word significance.

Solved – Spark MLLib’s Word2Vec cosine similarity greater than 1

The Spark documentation for this kind of thing doesn't seem very thorough, so I looked at the source. There's a comment here saying

// Need not divide with the norm of the given vector since it is constant.

This seems consistent with the following code.

So, it seems that findSynonyms doesn't actually return cosine distances, but rather cosine distances times the norm of the query vector. The ordering and relative values are consistent with the true cosine distance, but the actual values are all scaled.

Not sure why the number of iterations or data partitions should have any bearing on this.

Best Answer

Related Solutions

Solved – Should I normalize word2vec’s word vectors before using them

Solved – Spark MLLib’s Word2Vec cosine similarity greater than 1

Related Question