I had the same problem understanding it. It seems that the output score vector will be the same for all C terms. But the difference in error with each one-hot represented vectors will be different. Thus the error vectors are used in back-propagation to update the weights.
Please correct me, if I'm wrong.
source : https://iksinc.wordpress.com/tag/skip-gram-model/
Any code that iterates over 2*k
target words, or 2*k
context words, to create a total of 2*k
(context-word)->(target-word) pairs for training, is "skip-gram". Some of the diagrams or notation in the original paper may give the impression skip-gram is using multiple context words at once, or predicting multiple target words at once, but in fact it's always just a 1-to-1 training pair, involving pairs-of-words in the same (window
-sized) neighborhood.
(Only CBOW, which actually sums/averages multiple context words together, truly uses a combined range of ( w^(i-k)), ..., w^(i+k) )
words as a single NN-training example.)
If I recall correctly, the original word2vec paper described skip-gram in one way, but then at some point for CPU cache efficiency the Google-released word2vec.c code looped over the text in the opposite way – which has sometimes caused confusion for people reading that code, or other code modeled on it.
But whether you view skip-gram as predicting a central target word from individual nearby context words, or as predicting surrounding individual target words from a central context word, in the end each original text sample results in the exact same set of desired (context-word)->(target-word) predictions – just in a slightly different training order. Each ordering is reasonably called 'skip-gram' and winds up with similar results, at the end of bulk training.
Best Answer
Some basic concepts are valid through years :) and are used in many solutions and naturally contributing to naming of these solutions...
N-gram is a basic concept of a (sub)sequnece of consecutive words taken out of a given sequence (e.g. sentence).
k-skip-n-gram is a generalization where 'consecutive' is dropped. It is 'just' subsequence of the original sequence, e.g. every other word of the sentence is 2-skip-n-gram.
word2vec is more complicated beast, the buzzword :) here is 'embeddings', here is the original paper https://arxiv.org/pdf/1301.3781.pdf. It uses the concept of consequtive words and 'skip' and 'gram' made its way to the name of the algorithm. BTW there are two alternative ones used by word2vec solution: skip-gram and CBOW.