I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer.
The cross entropy error function is
$$E(t,o)=-\sum_j t_j \log o_j$$
with $t$ and $o$ as the target and output at neuron $j$, respectively. The sum is over each neuron in the output layer. $o_j$ itself is the result of the softmax function:
$$o_j=softmax(z_j)=\frac{e^{z_j}}{\sum_j e^{z_j}}$$
Again, the sum is over each neuron in the output layer and $z_j$ is the input to neuron $j$:
$$z_j=\sum_i w_{ij}o_i+b$$
That is the sum over all neurons in the previous layer with their corresponding output $o_i$ and weight $w_{ij}$ towards neuron $j$ plus a bias $b$.
Now, to update a weight $w_{ij}$ that connects a neuron $j$ in the output layer with a neuron $i$ in the previous layer, I need to calculate the partial derivative of the error function using the chain rule:
$$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial o_j} \frac{\partial o_j} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$$
with $z_j$ as the input to neuron $j$.
The last term is quite simple. Since there's only one weight between $i$ and $j$, the derivative is:
$$\frac{\partial z_j} {\partial w_{ij}}=o_i$$
The first term is the derivation of the error function with respect to the output $o_j$:
$$\frac{\partial E} {\partial o_j} = \frac{-t_j}{o_j}$$
The middle term is the derivation of the softmax function with respect to its input $z_j$ is harder:
$$\frac{\partial o_j} {\partial z_{j}}=\frac{\partial} {\partial z_{j}} \frac{e^{z_j}}{\sum_j e^{z_j}}$$
Let's say we have three output neurons corresponding to the classes $a,b,c$ then $o_b = softmax(b)$ is:
$$o_b=\frac{e^{z_b}}{\sum e^{z}}=\frac{e^{z_b}}{e^{z_a}+e^{z_b}+e^{z_c}} $$
and its derivation using the quotient rule:
$$\frac{\partial o_b} {\partial z_{b}}=\frac{e^{z_b}*\sum e^z – (e^{z_b})^2}{(\sum_j e^{z})^2}=\frac{e^{z_b}}{\sum e^z}-\frac{(e^{z_b})^2}{(\sum e^z)^2}$$
$$=softmax(b)-softmax^2(b)=o_b-o_b^2=o_b(1-o_b)$$
Back to the middle term for backpropagation this means:
$$\frac{\partial o_j} {\partial z_{j}}=o_j(1-o_j)$$
Putting it all together I get
$$\frac{\partial E} {\partial w_{ij}}= \frac{-t_j}{o_j}*o_j(1-o_j)*o_i=-t_j(1-o_j)*o_i$$
which means, if the target for this class is $t_j=0$, then I will not update the weights for this. That does not sound right.
Investigating on this I found people having two variants for the softmax derivation, one where $i=j$ and the other for $i\ne j$, like here or here.
But I can't make any sense out of this. Also I'm not even sure if this is the cause of my error, which is why I'm posting all of my calculations. I hope someone can clarify me where I am missing something or going wrong.
Best Answer
I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.
First, you need to take account of the summation in $E$, and you cannot assume each term only depends on one weight. So taking the gradient of $E$ with respect to component $k$ of $z$, we have $$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$$
Then, expressing $o_j$ as $$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$$ we have $$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$$ where $\delta_{jk}$ is the Kronecker delta. Then the gradient of the softmax-denominator is $$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$$ which gives $$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$$ or, expanding the log $$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$$ Note that the derivative is with respect to $z_k$, an arbitrary component of $z$, which gives the $\delta_{jk}$ term ($=1$ only when $k=j$).
So the gradient of $E$ with respect to $z$ is then $$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$$ where $\tau=\sum_jt_j$ is constant (for a given $t$ vector).
This shows a first difference from your result: the $t_k$ no longer multiplies $o_k$. Note that for the typical case where $t$ is "one-hot" we have $\tau=1$ (as noted in your first link).
A second inconsistency, if I understand correctly, is that the "$o$" that is input to $z$ seems unlikely to be the "$o$" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?
Calling this vector $y$, we then have $$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$$
Finally, to get the gradient of $E$ with respect to the weight-matrix $w$, we use the chain rule $$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$$ giving the final expression (assuming a one-hot $t$, i.e. $\tau=1$) $$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$$ where $y$ is the input on the lowest level (of your example).
So this shows a second difference from your result: the "$o_i$" should presumably be from the level below $z$, which I call $y$, rather than the level above $z$ (which is $o$).
Hopefully this helps. Does this result seem more consistent?