I'm not an expert in word2vec, but upon reading Rong, X. (2014). word2vec Parameter Learning Explained and from my own NN experience, I'd simplify the reasoning to this:
- Hierarchical softmax provides for an improvement in training
efficiency since the output vector is determined by a tree-like
traversal of the network layers; a given training sample only has to
evaluate/update $O(log(N))$ network units, not $O(N)$. This
essentially expands the weights to support a large vocabulary - a
given word is related to fewer neurons and visa-versa.
- Negative sampling is a way to sample the training data, similar to
stochastic gradient descent, but the key is you look for negative
training examples. Intuitively, it trains based on sampling places it
might have expected a word, but didn't find one, which is faster than
training an entire corpus every iteration and makes sense for common
words.
The two methods don't seem to be exclusive, theoretically, but anyway that seems to be why they'd be better for frequent and infrequent words.
Any code that iterates over 2*k
target words, or 2*k
context words, to create a total of 2*k
(context-word)->(target-word) pairs for training, is "skip-gram". Some of the diagrams or notation in the original paper may give the impression skip-gram is using multiple context words at once, or predicting multiple target words at once, but in fact it's always just a 1-to-1 training pair, involving pairs-of-words in the same (window
-sized) neighborhood.
(Only CBOW, which actually sums/averages multiple context words together, truly uses a combined range of ( w^(i-k)), ..., w^(i+k) )
words as a single NN-training example.)
If I recall correctly, the original word2vec paper described skip-gram in one way, but then at some point for CPU cache efficiency the Google-released word2vec.c code looped over the text in the opposite way – which has sometimes caused confusion for people reading that code, or other code modeled on it.
But whether you view skip-gram as predicting a central target word from individual nearby context words, or as predicting surrounding individual target words from a central context word, in the end each original text sample results in the exact same set of desired (context-word)->(target-word) predictions – just in a slightly different training order. Each ordering is reasonably called 'skip-gram' and winds up with similar results, at the end of bulk training.
Best Answer
Here is my oversimplified and rather naive understanding of the difference:
As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context
yesterday was really [...] day
CBOW model will tell you that most probably the word isbeautiful
ornice
. Words likedelightful
will get much less attention of the model, because it is designed to predict the most probable word. Rare words will be smoothed over a lot of examples with more frequent words.On the other hand, the skip-gram is designed to predict the context. Given the word
delightful
it must understand it and tell us, that there is huge probability, the context isyesterday was really [...] day
, or some other relevant context. With skip-gram the worddelightful
will not try to compete with wordbeautiful
but instead,delightful+context
pairs will be treated as new observations. Because of this, skip-gram will need more data so it will learn to understand even rare words.