Word2Vec Skip-Gram Model – Generating Output Vectors

deep learningnatural languageneural networksword embeddingsword2vec

I am having problems understanding the skip-gram model of the Word2Vec algorithm.

In continuous bag-of-words is easy to see how the context words can "fit" in the Neural Network, since you basically average them after multiplying each of the one-hot encoding representations with the input matrix W.

However, in the case of skip-gram, you only get the input word vector by multiplying the one-hot encoding with the input matrix and then you are suppose to get C (= window size) vectors representations for the context words by multiplying the input vector representation with the output matrix W'.

What I mean is, having a vocabulary of size $V$ and encodings of size $N$, $W \in \mathbb{R}^{V\times N}$ input matrix and $W' \in \mathbb{R}^{N\times V}$ as output matrix. Given the word $w_i$ with one-hot encoding $x_i$ with context words $w_j$ and $w_h$ (with one-hot reps $x_j$ and $x_h$), if you multiply $x_i$ by the input matrix $W$ you get ${\bf h} := x_i^TW = W_{(i,\cdot)} \in \mathbb{R}^N$, now how do you generate $C$ score vectors from this?

Best Answer

I had the same problem understanding it. It seems that the output score vector will be the same for all C terms. But the difference in error with each one-hot represented vectors will be different. Thus the error vectors are used in back-propagation to update the weights.

Please correct me, if I'm wrong.

source : https://iksinc.wordpress.com/tag/skip-gram-model/