Word Embeddings – Should Cosine or Dot Similarity Be Used Inside Word2Vec’s Neural Network

word embeddingsword2vec

I've implemented the word2vec algorithm according to its negative sampling architecture,using a shallow neural network that performs binary classification on word-embedding vector pairs. The network is expected to output 1 for pairs that occur in the corpus and 0 for the random negative pairs. In the final neuron, my implementation calculates the dot product of the two vectors before passing it to a sigmoid activation and later down the line, cross entropy is calculated and averaged over the batch.

My question is, should I use cosine instead of dot? I'm well aware that they differ only by the normalization of the vectors, however, I was unable to find a clear answer. Does this actually affect the quality of the embedding vectors? Or the use of cosine vs dot only matters when similarity is calculated for among the trained embedding vectors?

Best Answer

If you normalize the features, dot product is the same as cosine similarity. As for the general answer, there is usually no single approach that you should always use. Usually, the choice is an empirical one: you try different ones and compare the results. Sometimes, in practice, the choice is arbitrary: assuming that different metrics are quite similar and give very similar results, you just pick one.