I had the same problem understanding it. It seems that the output score vector will be the same for all C terms. But the difference in error with each one-hot represented vectors will be different. Thus the error vectors are used in back-propagation to update the weights.
Please correct me, if I'm wrong.
source : https://iksinc.wordpress.com/tag/skip-gram-model/
One possible approach could be to encode the $i$th data point as follows:
$$x_i = [n_i, w_{i1}, \dots, w_{im}]$$
$n_i$ is a numeric value and $w_{i1}, \dots, w_{im}$ are binary, corresponding to a one-hot encoding of a word from a vocabulary of size $m$. If the $i$th data point is a word, then $n_i$ is set to zero, and each $w_{ij}$ is set to 1 if the word matches the $j$th element of the vocabulary (otherwise 0). If the $i$th data point is numeric, $n_i$ is set to this value, and all $w_{ij}$ are set to 0. The numeric values should probably be normalized after encoding this way.
This approach treats numerically similar values as similar to each other (e.g. the input representation of 9.99 is similar to that of 10, but different than 1000). This may or may not be appropriate, depending on your application. You could also imagine applying various transformations (e.g. taking the log to squash large values together).
If you're going to treat common numeric values as separate words, you could also map all non-common numeric values to the same word, representing 'uncommon numeric value'. Of course, this would make them indistinguishable.
Another possible approach would be to quantize the numeric values (possibly adaptively, i.e. with unequal bin widths). Then map all values within each bin to the same 'word'. A possible downside of this approach is that the quantization is performed a priori, so it's independent of the context in which a particular numeric value occurs. For example, 9.9 and 10 might mean very similar things in one context, but very different things in another. If they're mapped to the same word, the distinction would be lost.
Best Answer
It would be unwise (how are you going to do optimization with such data?).
But you don't have to. Basically you're asking how to deal with unknown words.
One answer for that is to just use some other representation for words - instead as representing them as one-hot vectors from some vocabulary, you can use subword features (like characters or character n-grams) - you can find papers using this terminology, they're also called character-level features.
For intuition you could look into lingustic knowledge - most words aren't actually completely unrelated to other words, but they're formed from more basic parts, or morphemes.