Solved – the relation of the negative sampling (NS) objective function to the original objective function in word2vec

deep learningnatural languageneural networksword embeddingsword2vec

I was reading the standard/famous word2vec model and according to standord's notes for cs224n the objective function changes from:

$$J_{original} = -\sum^{2m}_{j=0,j\neq m} u^\top_{c-m+j} v_c + 2m log \left( \sum^{|V|}_{k=1} exp(u^{\top}_k v_c) \right)$$

to:

$$J_{NS1} = -log \sigma( u^\top_{c-m+j} v_c ) – \sum^{K}_{k=1} log \sigma( -u^{\top}_k v_c )$$

or

$$ J_{NS2} = – \left( \log\sigma( v_{w_o}^T v_{w_c} ) + \sum^K_{i=1} \mathbb{E}_{i \sim P(w)} \left[ \log \sigma( – u^T_{w_i} v_{w_c})\right] \right)$$

I was wondering, where does the second objective function come from? Where does negative sampling come from? I don't require a rigurous proof/derivation but any type of justification would be nice. Wow is the second one approximating the first? In any sense? Rough, approximation, intuitive, is there anything to justify this?

Note I understand that there is a speed gain. I am more interested in understanding what might have been the thought process to derive the above while still approximately wanting to optimize the original function or have good word embeddings.


My own thoughts:

Let $P_{\theta}(D=1 \mid w,c)$ be the probability that a given pair $(w,c)$ word and context came from the corpus data. Consider $-J_{NS1} = \log \sigma( u^\top_{c-m+j} v_c ) + \sum^{K}_{k=1} log \sigma( -u^{\top}_k v_c )$ (i.e. lets view things as maximizing probabilities). It seems that maximizing the first term correctly outputs two word vectors that are correlated since to make $-J_{NS1}$ large one can make the first term large by making the first term close to 1, which can be attained by making the inner product of the vectors large.

However, it seems to me that the second term is actually motivating us to get back word representations that are bad. Lets look at what the second term is:

$$ \log \sigma( -u^\top_{\text{not context}} v_{center}) = \log \left(1 – \sigma( u^\top_{\text{not context}} v_{center}) \right)$$

we can increase the above term by making $1 – \sigma( u^\top_{\text{not context}} v_{center})$ large which means we make $\sigma( u^\top_{\text{not context}} v_{center})$ small (close to zero "probability"). This means that we want very negative argument to the sigmoid. Which means that we get vectors that have a large negative inner product. This seems sort of wrong to me because if the inner product was zero i.e. the words were perpendicular, that would be a better objective to have. Why did they choose the other one instead? Wouldn't perpendicular words be better? i.e. if the words are not similar and thus not correlated then they have nothing to do with each other and thus have zero inner product.

Essentially, why is negative inner product a better sense of word similarity than inner product that is zero?

Best Answer

The answer to the question referenced by @amoeba in the comment on your question answers this quite well, but I would like to make two points.

First, to expand on a point in that answer, the objective being minimized is not the negative log of the softmax function. Rather, it is defined as a variant of noise contrastive estimation (NCE), which boils down to a set of $K$ logistic regressions. One is used for the positive sample (i.e., the true context word given the center word), and the remaining $K-1$ are used for the negative samples (i.e., the false/fake context word given the center word).

Second, the reason you would want a large negative inner product between the false context words and the center word is because this implies that the words are maximally dissimilar. To see this, consider the formula for cosine similarity between two vectors $x$ and $y$: $$ s_{cos}(x, y) = \frac{x^Ty}{\|x\|_2\,\|y\|_2} $$ This attains a minimum of $-1$ when $x$ and $y$ are oriented in opposite directions and equals $0$ when $x$ and $y$ are perpendicular. If they are perpendicular, they contain none of the same information, while if they are oriented oppositely, they contain opposite information. If you imagine word vectors in 2D, this is like saying that the word "bright" has the embedding $[1\;0],$ "dark" has the embedding $[-1\;0],$ and "delicious" has the embedding $[0\;1].$ In our simple example, "bright" and "dark" are opposites. Predicting that something is "dark" when it is "bright" would be maximally incorrect as it would convey exactly the opposite of the intended information. On the other hand, the word "delicious" carries no information about whether something is "bright" or "dark", so it is oriented perpendicularly to both.

This is also a reason why embeddings learned from word2vec perform well at analogical reasoning, which involves sums and differences of word vectors. You can read more about the task in the word2vec paper.

Related Question