Solved – the function that is being optimized in word2vec

word2vec

The following question is about Skipgram, but it would be a plus (though not essential) to answer the question for the CBOW model as well.

Word2Vec uses neural networks, and neural networks learn by doing gradient descent on some objective function. So my question is:

How are the words inputted into a Word2Vec model? In other words, what part of the neural network is used to derive the vector representations of the words?
What part of the neural network are the context vectors pulled from?
What is the objective function which is being minimized?

Best Answer

How are the words inputted into a Word2Vec model? In other words, what part of the neural network is used to derive the vector representations of the words?

See Input vector representation vs output vector representation in word2vec

What is the objective function which is being minimized?

The original word2vec papers are notoriously unclear on some points pertaining to the training of the neural network (Why do so many publishing venues limit the length of paper submissions?). I advise you look at {1-4}, which answer this question.

References:

{1} Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014). https://arxiv.org/abs/1411.2738
{2} Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722 (2014). https://arxiv.org/abs/1402.3722
{3} TensorFlow's tutorial on Vector Representations of Words
{4} Stanford CS224N: NLP with Deep Learning by Christopher Manning | Winter 2019 | Lecture 2 – Word Vectors and Word Senses. https://youtu.be/kEMJRjEdNzM?t=1565 (mirror)

Best Answer

Related Solutions

Word2Vec Skip-Gram Model – Generating Output Vectors

Solved – the relation of the negative sampling (NS) objective function to the original objective function in word2vec

Related Question