Solved – How is the Cross-Entropy Cost Function back-propagated

backpropagationcross entropymachine learningneural networks

I've looked at a few threads about this but they've not been exactly what I'm after. When back-propagating the quadratic cost function, you first find the output error from $\delta_L = \bigtriangledown_a C \odot\sigma\prime (z_L)$.

You then can backpropagate this error through the network, where: $\delta_l = ((w_{l+1})^T\delta_{l+1})\odot\sigma\prime (z_l)$. The gradients, and therefore how the weights and biases change, are then based on these errors, so that $\frac{\partial C}{\partial b_l} = \delta_l$ and $\frac{\partial C}{\partial w_l} = a_{l-1}\delta_l$.

Source for this

However, the gradients for the cross-entropy cost function are not based on $\delta_l$, being $\frac{\partial C}{\partial b_l} = \sigma(z)-y$ and $\frac{\partial C}{\partial w_l} = x_j(\sigma(z)-y)$.

Source for this

It's obvious then that the gradient of the bias of the output neuron(s) is $\sigma(z)-y$ and the gradient of the weights connecting the final hidden layer to the output layer is $x_j(\sigma(z)-y)$. However, what would the gradients be on other layers please? How do you actually back-propagate this?

Thanks for any help!

Best Answer

Firstly, note that $\delta_l$ is nothing else than $\frac{\partial C}{\partial z_l}$, which can be expanded via chain rule to $\frac{\partial C}{\partial a_l}\frac{\partial a_l}{\partial z_l}$.

Moreover, you know that the derivative of each bias can be computed as $\frac{\partial C}{\partial b_l} = \delta_l$ and derivative of each weight is $\frac{\partial C}{\partial w_l} = a_{l-1}\delta_l$. This also holds for the last layer: $\delta_L$ is $\sigma(z)-y$. For derivation of $\delta_L = \frac{\partial C}{\partial z_L} = \sigma(z)-y$, see for example this answer.

You can simply plug it as $\delta_{l+1}$ to compute any preceding $\delta_l$.