I am going through the problems in the Stanford NLP deep learning class's written assignment problems http://cs224d.stanford.edu/assignment1/assignment1_soln
I am trying to understand the answer for 3a where they are looking for the derivative to the vector for the center word.
Assume you are given a predicted word vector $v_{c}$ corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models.
$\hat{y}^{o} = p(o | c) = \frac {exp(u_{o}^{T} v_{c})}{\sum_{w=1}^{W}exp(u_{w}^{T} v_{c})}$
Where w denotes the w-th word and $u_w $ (w = 1, . . . , W) are the “output” word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word.
Where $U = [u_1,u_2, · · · ,u_W ]$ is the matrix of all the output vectors, and let $\hat{y}$ be the column vector of the softmax prediction of words, and y be the one-hot label which is also a column vector.
Where cross entropy is $CE(y, \hat{y}) = − \sum_iy_i\log(\hat{y}_i)$
So the answer for the gradient for the center vector is $\frac{∂J}{∂v_c}= U^T(\hat{y} − y).$
Could someone show me the steps to get to this? I have been using this question as reference Derivative of cross entropy loss in word2vec but I specifically want to know the $U^T(\hat{y} − y).$ representation.
Best Answer
First, let's lay out what we have got and our assumptions about the shapes of different vectors. Let,
Now, we can write $$J = - \sum_{i=1}^W y_i log(\frac{exp(u_i^Tv_c)}{\sum_{w=1}^Wexp(u_w^Tv_c)})$$ Simplifying, $$ J = - \sum_{i=1}^Wy_i[u_i^Tv_c - log(\sum_{w=1}^Wexp(u_w^Tv_c))] $$ Now, we know that $y$ is one-hot encoded, so all its elements are zero except the one at, say, $k^{th}$ index. Which means, there's only one non-zero term in the summation above corresponding to $y_k$ and all others terms in the summation are zeros. So the cost can also be written as: $$J = -y_k[u_k^Tv_c - log(\sum_{w=1}^Wexp(u_w^Tv_c))]$$ Note: above $y_k$ is 1.
Solving for $\frac{\partial J}{\partial v_c}$ : $$ \frac{\partial J}{\partial v_c} = -[u_k - \frac{\sum_{w=1}^Wexp(u_w^Tv_c)u_w}{\sum_{x=1}^Wexp(u_x^Tv_c)}]$$
Which can be re-arranged as: $$\frac{\partial J}{\partial v_c} = \sum_{w=1}^W (\frac{exp(u_w^Tv_c)}{\sum_{x=1}^W exp(u_x^Tv_c)}u_w) - u_k$$ Using definition (6), we can rewrite the above equation as: $$\frac{\partial J}{\partial v_c} = \sum_{w=1}^W (\hat{y}_w u_w) - u_k$$
Now let's see how this can be written in Matrix notation.Note that:
So the whole thing can be succinctly written as: $$U[\hat{y} -y]$$
Finally, note that we assumed $u_i$s to be a column vectors. If we had started with row vectors, we would get $U^T[\hat{y} -y]$, same as what you were looking for.