Solved – Backpropagation with Softmax / Cross Entropy

backpropagationcross entropyderivativesoftmax

I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer.

The cross entropy error function is

$$E(t,o)=-\sum_j t_j \log o_j$$

with $t$ and $o$ as the target and output at neuron $j$, respectively. The sum is over each neuron in the output layer. $o_j$ itself is the result of the softmax function:

$$o_j=softmax(z_j)=\frac{e^{z_j}}{\sum_j e^{z_j}}$$

Again, the sum is over each neuron in the output layer and $z_j$ is the input to neuron $j$:

$$z_j=\sum_i w_{ij}o_i+b$$

That is the sum over all neurons in the previous layer with their corresponding output $o_i$ and weight $w_{ij}$ towards neuron $j$ plus a bias $b$.

Now, to update a weight $w_{ij}$ that connects a neuron $j$ in the output layer with a neuron $i$ in the previous layer, I need to calculate the partial derivative of the error function using the chain rule:

$$\frac{\partial E} {\partial w_{ij}}=\frac{\partial E} {\partial o_j} \frac{\partial o_j} {\partial z_{j}} \frac{\partial z_j} {\partial w_{ij}}$$

with $z_j$ as the input to neuron $j$.

The last term is quite simple. Since there's only one weight between $i$ and $j$, the derivative is:

$$\frac{\partial z_j} {\partial w_{ij}}=o_i$$

The first term is the derivation of the error function with respect to the output $o_j$:

$$\frac{\partial E} {\partial o_j} = \frac{-t_j}{o_j}$$

The middle term is the derivation of the softmax function with respect to its input $z_j$ is harder:

$$\frac{\partial o_j} {\partial z_{j}}=\frac{\partial} {\partial z_{j}} \frac{e^{z_j}}{\sum_j e^{z_j}}$$

Let's say we have three output neurons corresponding to the classes $a,b,c$ then $o_b = softmax(b)$ is:

$$o_b=\frac{e^{z_b}}{\sum e^{z}}=\frac{e^{z_b}}{e^{z_a}+e^{z_b}+e^{z_c}} $$

and its derivation using the quotient rule:

$$\frac{\partial o_b} {\partial z_{b}}=\frac{e^{z_b}*\sum e^z – (e^{z_b})^2}{(\sum_j e^{z})^2}=\frac{e^{z_b}}{\sum e^z}-\frac{(e^{z_b})^2}{(\sum e^z)^2}$$
$$=softmax(b)-softmax^2(b)=o_b-o_b^2=o_b(1-o_b)$$
Back to the middle term for backpropagation this means:
$$\frac{\partial o_j} {\partial z_{j}}=o_j(1-o_j)$$

Putting it all together I get

$$\frac{\partial E} {\partial w_{ij}}= \frac{-t_j}{o_j}*o_j(1-o_j)*o_i=-t_j(1-o_j)*o_i$$

which means, if the target for this class is $t_j=0$, then I will not update the weights for this. That does not sound right.

Investigating on this I found people having two variants for the softmax derivation, one where $i=j$ and the other for $i\ne j$, like here or here.

But I can't make any sense out of this. Also I'm not even sure if this is the cause of my error, which is why I'm posting all of my calculations. I hope someone can clarify me where I am missing something or going wrong.

Best Answer

Note: I am not an expert on backprop, but now having read a bit, I think the following caveat is appropriate. When reading papers or books on neural nets, it is not uncommon for derivatives to be written using a mix of the standard summation/index notation, matrix notation, and multi-index notation (include a hybrid of the last two for tensor-tensor derivatives). Typically the intent is that this should be "understood from context", so you have to be careful!

I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.

First, you need to take account of the summation in $E$, and you cannot assume each term only depends on one weight. So taking the gradient of $E$ with respect to component $k$ of $z$, we have $$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$$

Then, expressing $o_j$ as $$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$$ we have $$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$$ where $\delta_{jk}$ is the Kronecker delta. Then the gradient of the softmax-denominator is $$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$$ which gives $$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$$ or, expanding the log $$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$$ Note that the derivative is with respect to $z_k$, an arbitrary component of $z$, which gives the $\delta_{jk}$ term ($=1$ only when $k=j$).

So the gradient of $E$ with respect to $z$ is then $$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$$ where $\tau=\sum_jt_j$ is constant (for a given $t$ vector).

This shows a first difference from your result: the $t_k$ no longer multiplies $o_k$. Note that for the typical case where $t$ is "one-hot" we have $\tau=1$ (as noted in your first link).

A second inconsistency, if I understand correctly, is that the "$o$" that is input to $z$ seems unlikely to be the "$o$" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?

Calling this vector $y$, we then have $$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$$

Finally, to get the gradient of $E$ with respect to the weight-matrix $w$, we use the chain rule $$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$$ giving the final expression (assuming a one-hot $t$, i.e. $\tau=1$) $$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$$ where $y$ is the input on the lowest level (of your example).

So this shows a second difference from your result: the "$o_i$" should presumably be from the level below $z$, which I call $y$, rather than the level above $z$ (which is $o$).

Hopefully this helps. Does this result seem more consistent?

Update: In response to a query from the OP in the comments, here is an expansion of the first step. First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have $$\frac{\partial E}{\partial w_{pq}}=\sum_i \frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial w_{pq}}$$ and $$\frac{\partial o_i}{\partial w_{pq}}=\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}$$ so $$\frac{\partial E}{\partial w_{pq}}=\sum_i \left[ \frac{\partial E}{\partial o_i}\left(\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}\right) \right]$$ In practice the full summations reduce, because you get a lot of $\delta_{ab}$ terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.