Cross Entropy – Derivation of Cross Entropy Loss in Word2Vec

machine learningself-studyword2vec

I am trying to work my way through the first problem set of the cs224d online stanford class course material and I am having some issues with problem 3A: When using the skip gram word2vec model with the softmax prediction function and the cross entropy loss function, we want to calculate the gradients with respect to the predicted word vectors. So given the softmax function:

$ \hat{w_i} = \Pr(word_i\mid\hat{r}, w) = \frac{\exp(w_i^T \hat{r})}{\sum_{j}^{|V|}exp(w_j^T\hat{r})}$

and cross entropy function:

$CE(w, \hat{w}) = -\sum\nolimits_{k} w_klog(\hat{w_k})$

we need to calculate $\frac{\partial{CE}}{\partial{\hat{r}}}$

My steps are as follows:

$CE(w, \hat{w}) = -\sum_{k}^{|V|} w_klog(\frac{\exp(w_k^T \hat{r})}{\sum_{j}^{|V|}exp(w_j^T\hat{r})})$

$= -\sum_{k}^{|V|} w_klog(\exp(w_k^T \hat{r}) – w_klog(\sum_{j}^{|V|}exp(w_j^T\hat{r}))$

now given $w_k$ is a one hot vector and i is the correct class:

$CE(w, \hat{w}) = – w_i^T\hat{r} + log(\sum_{j}^{|V|}exp(w_j^T\hat{r}))$

$\frac{\partial{CE}}{\partial{\hat{r}}} = -w_i + \frac{1}{\sum_{j}^{|V|}exp(w_j^T\hat{r})}\sum_{j}^{|V|}exp(w_j^T\hat{r})w_j$

Is this correct or could it be simplified further? I want to try to make sure I am on the right track as the problem set solutions aren't posted online. Plus getting the written assignments correct are important to being able to properly do the programming assignments.

Best Answer

$$\frac{\partial{CE}}{\partial{\hat{r}}} = -w_i + \frac{1}{\sum_{j}^{|V|}exp(w_j^T\hat{r})}\sum_{j}^{|V|}exp(w_j^T\hat{r})w_j$$ can be rewritten as $$\frac{\partial{CE}}{\partial{\hat{r}}} = -w_i + \sum_{j}^{|V|} \left( \frac{ \exp(w_j^\top\hat{r}) }{\sum_{j}^{|V|}exp(w_j^T\hat{r})} \cdot w_j \right)$$ note, the sums are both indexed by j but it really should be 2 different variables. This would be more appropriate $$\frac{\partial{CE}}{\partial{\hat{r}}} = -w_i + \sum_{x}^{|V|} \left( \frac{ \exp(w_x^\top\hat{r}) }{\sum_{j}^{|V|}exp(w_j^T\hat{r})} \cdot w_x \right)$$ which translates to $$\frac{\partial{CE}}{\partial{\hat{r}}} = -w_i + \sum_{x}^{|V|} \Pr(word_x\mid\hat{r}, w) \cdot w_x$$

Related Question