Solved – Gradients for skipgram word2vec

backpropagationneural networksself-studyword2vec

I am going through the problems in the Stanford NLP deep learning class's written assignment problems http://cs224d.stanford.edu/assignment1/assignment1_soln

I am trying to understand the answer for 3a where they are looking for the derivative to the vector for the center word.

Assume you are given a predicted word vector $v_{c}$ corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models.

$\hat{y}^{o} = p(o | c) = \frac {exp(u_{o}^{T} v_{c})}{\sum_{w=1}^{W}exp(u_{w}^{T} v_{c})}$

Where w denotes the w-th word and $u_w $ (w = 1, . . . , W) are the “output” word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word.

Where $U = [u_1,u_2, · · · ,u_W ]$ is the matrix of all the output vectors, and let $\hat{y}$ be the column vector of the softmax prediction of words, and y be the one-hot label which is also a column vector.

Where cross entropy is $CE(y, \hat{y}) = − \sum_iy_i\log(\hat{y}_i)$

So the answer for the gradient for the center vector is $\frac{∂J}{∂v_c}= U^T(\hat{y} − y).$

Could someone show me the steps to get to this? I have been using this question as reference Derivative of cross entropy loss in word2vec but I specifically want to know the $U^T(\hat{y} − y).$ representation.

Best Answer

First, let's lay out what we have got and our assumptions about the shapes of different vectors. Let,

  1. $|W|$ be the number of words in the vocab
  2. $y$ and $\hat{y}$ be column vectors of shape $|W|$ x 1
  3. $u_i$ and $v_j$ be the column vectors of shape $D$ X 1 ($D$ = dimension of embeddings)
  4. $y$ be the one-hot encoded column vector of shape $|W|$ x 1
  5. $\hat{y}$ be the softmax prediction column vector of shape $|W|$ x 1
  6. $\hat{y}_i = P(i|c) = \frac{exp(u_i^Tv_c)}{\sum_{w=1}^Wexp(u_w^Tv_c)}$
  7. Cross entropy loss: $J = -\sum_{i=1}^Wy_ilog({\hat{y_i}})$
  8. $U = [u_1, u_2, ...,u_k, ...u_W]$ be a matrix composed of $u_k$ column vectors.

Now, we can write $$J = - \sum_{i=1}^W y_i log(\frac{exp(u_i^Tv_c)}{\sum_{w=1}^Wexp(u_w^Tv_c)})$$ Simplifying, $$ J = - \sum_{i=1}^Wy_i[u_i^Tv_c - log(\sum_{w=1}^Wexp(u_w^Tv_c))] $$ Now, we know that $y$ is one-hot encoded, so all its elements are zero except the one at, say, $k^{th}$ index. Which means, there's only one non-zero term in the summation above corresponding to $y_k$ and all others terms in the summation are zeros. So the cost can also be written as: $$J = -y_k[u_k^Tv_c - log(\sum_{w=1}^Wexp(u_w^Tv_c))]$$ Note: above $y_k$ is 1.

Solving for $\frac{\partial J}{\partial v_c}$ : $$ \frac{\partial J}{\partial v_c} = -[u_k - \frac{\sum_{w=1}^Wexp(u_w^Tv_c)u_w}{\sum_{x=1}^Wexp(u_x^Tv_c)}]$$

Which can be re-arranged as: $$\frac{\partial J}{\partial v_c} = \sum_{w=1}^W (\frac{exp(u_w^Tv_c)}{\sum_{x=1}^W exp(u_x^Tv_c)}u_w) - u_k$$ Using definition (6), we can rewrite the above equation as: $$\frac{\partial J}{\partial v_c} = \sum_{w=1}^W (\hat{y}_w u_w) - u_k$$

Now let's see how this can be written in Matrix notation.Note that:

  1. $u_k$ can be written as Matrix vector multiplication: $U.y$
  2. And $\sum_{w=1}^W (\hat{y}_w u_w)$ is a linear transformation of vectors $u_w$ in $U$ scaled by $\hat{y}_w$ respectively. This again can be written as $U.\hat{y}$

So the whole thing can be succinctly written as: $$U[\hat{y} -y]$$

Finally, note that we assumed $u_i$s to be a column vectors. If we had started with row vectors, we would get $U^T[\hat{y} -y]$, same as what you were looking for.