Solved – Why is skip-gram better for infrequent words than CBOW

natural languageword embeddingsword2vec

I wonder why skip-gram is better for infrequent words than CBOW in word2vec. I have read the claim on https://code.google.com/p/word2vec/.

Best Answer

Here is my oversimplified and rather naive understanding of the difference:

As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. Rare words will be smoothed over a lot of examples with more frequent words.

On the other hand, the skip-gram is designed to predict the context. Given the word delightful it must understand it and tell us, that there is huge probability, the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with word beautiful but instead, delightful+context pairs will be treated as new observations. Because of this, skip-gram will need more data so it will learn to understand even rare words.