First, let's lay out what we have got and our assumptions about the shapes of different vectors. Let,
- $|W|$ be the number of words in the vocab
- $y$ and $\hat{y}$ be column vectors of shape $|W|$ x 1
- $u_i$ and $v_j$ be the column vectors of shape $D$ X 1 ($D$ = dimension of embeddings)
- $y$ be the one-hot encoded column vector of shape $|W|$ x 1
- $\hat{y}$ be the softmax prediction column vector of shape $|W|$ x 1
- $\hat{y}_i = P(i|c) = \frac{exp(u_i^Tv_c)}{\sum_{w=1}^Wexp(u_w^Tv_c)}$
- Cross entropy loss: $J = -\sum_{i=1}^Wy_ilog({\hat{y_i}})$
- $U = [u_1, u_2, ...,u_k, ...u_W]$ be a matrix composed of $u_k$ column vectors.
Now, we can write
$$J = - \sum_{i=1}^W y_i log(\frac{exp(u_i^Tv_c)}{\sum_{w=1}^Wexp(u_w^Tv_c)})$$
Simplifying,
$$ J = - \sum_{i=1}^Wy_i[u_i^Tv_c - log(\sum_{w=1}^Wexp(u_w^Tv_c))] $$
Now, we know that $y$ is one-hot encoded, so all its elements are zero except the one at, say, $k^{th}$ index. Which means, there's only one non-zero term in the summation above corresponding to $y_k$ and all others terms in the summation are zeros. So the cost can also be written as:
$$J = -y_k[u_k^Tv_c - log(\sum_{w=1}^Wexp(u_w^Tv_c))]$$
Note: above $y_k$ is 1.
Solving for $\frac{\partial J}{\partial v_c}$ :
$$ \frac{\partial J}{\partial v_c} = -[u_k - \frac{\sum_{w=1}^Wexp(u_w^Tv_c)u_w}{\sum_{x=1}^Wexp(u_x^Tv_c)}]$$
Which can be re-arranged as:
$$\frac{\partial J}{\partial v_c} = \sum_{w=1}^W (\frac{exp(u_w^Tv_c)}{\sum_{x=1}^W exp(u_x^Tv_c)}u_w) - u_k$$
Using definition (6), we can rewrite the above equation as:
$$\frac{\partial J}{\partial v_c} = \sum_{w=1}^W (\hat{y}_w u_w) - u_k$$
Now let's see how this can be written in Matrix notation.Note that:
- $u_k$ can be written as Matrix vector multiplication: $U.y$
- And $\sum_{w=1}^W (\hat{y}_w u_w)$ is a linear transformation of vectors $u_w$ in $U$ scaled by $\hat{y}_w$ respectively. This again can be written as $U.\hat{y}$
So the whole thing can be succinctly written as:
$$U[\hat{y} -y]$$
Finally, note that we assumed $u_i$s to be a column vectors. If we had started with row vectors, we would get $U^T[\hat{y} -y]$, same as what you were looking for.
Best Answer
Given an input, word $w_I$, Skipgram learns the probability distribution of words which are likely to co-occur with it in a context window of a given size. The $j$'th node on the output layer gives the probability of observing word $w_j$ in word $w_I$'s context window.
You seem to have same strange notation in your formula. $u$ for example is referenced with both one and two subscripts.
I think this is a better way to see it:
Skip-gram models the probability of a word $w_o$ being observed within word $w_i$'s context window as:
$p(w_o | w_i) = y_o$
$y = Softmax(z)$
$z = W_i C^T$
where $W$ is the word vector matrix ($|V| \times d$) and $C$ is the context vector matrix ($|V| \times d$), given vocabulary of size |V|. $z$ is going to be a $|V|$ dimensional vector which corresponds to the dot product of the input word's vector $W_i$ and every context word vector. $y$ transforms this into a probability distribution using the $Softmax$, also a $|V|$ dimensional vector, which is indexed at position $o$ to get the probability of observing word $w_o$.
This makes it very clear that the goal is to align word and context vectors of words which tend to co-occur, and similarly to spread apart those of pairs of words which do not co-occur.
Hope that helps!