Solved – Exact details of how word2vec (Skip-gram and CBOW) generate input word pairs

natural languageneural networksword2vec

I am trying to reimplement skip-gram and CBOW. I think I understand the architecture well, but I am confused on how to input pairs are exactly generated.

For skip-gram, based on McCormick's post, a sentence like "I like soup very much" would become five training examples, assuming "soup" is the center word, and the window size is 2:

soup, I
soup, like
soup, very
soup, much

Then, when I am reading an actual implementation from stanford cs20i. It introduced randomness, basically, instead of using the specified window size ($w$), it chose a value between 1 and $w$. Effectively, I think it downsamples the training data, but can anyone explain why this is necessary? I thought the biggest selling points of word2vec is to leverage as much data as possible with a computationally efficient model, why downsample then.

For CBOW, I am confused reading different sources. At least one version seems to be averaging the input words within the context and then predict the center word. e.g. assuming using the same sentence and window size as above, there is only one training data point corresponding to the center word

(I + like + very + much) / 4, soup

I have two questions regarding CBOW:

First, is this the way done in the word2vec paper, Efficient Estimation of Word Representations in Vector Space? I am somewhat confused reading it.

Second, can we just do the exact opposite to the skip-gram way of generating word pairs? Then, we would get

I, soup
like, soup 
very, soup 
much, soup 

which are more training examples, aren't they?

As I understand it, all input/output word pairs are represented in one-hot vectors.

Best Answer

I did more research and found answers:

1. Why is downsampling used in the skip-gram case?

Quote from its paper,

We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples.

So by choosing a random window size within the specified one, it effectively gave less weight to more distant words.

2. The way CBOW treats input and output.

Suppose the training example is [(I, like, very, much), soup], the input one hot vectors are summed instead of averaged. See the diagram from the CBOW paper.

enter image description here

Also, it doesn't matter to sum the one hot vectors first and then multiply with embedding matrix, or multiple first and then sum, they are equivalent mathematically.

3. Can we just do the exact opposite to the Skip-gram way of generating word pairs

No, because this is not the way how CBOW worked as described above. However, it might be interesting to try this way and see how the results would differ from Skip-gram. In a way, it appears to be more similar to the Skip-gram model than CBOW.

For CBOW, see more details in the answer to another thread, Tensorflow: Word2vec CBOW model.

Related Question