Solved – Differentiation of Cross Entropy

cross entropyderivativedifferential equationsmachine learningneural networks

I have been trying to create a program for training Neural Networks on my computer. For the Network in question, I have decided to use the Cross Entropy Error function:

$$E = -\sum_jt_j\ln o_j$$

Where $t_j$ is the target output for the Neuron $j$, and $o_j$ is the output of that neuron, attempting to predict $t_j$.

I want to know what $\frac{\delta E}{\delta o_j}$ is for some Neuron $j$. My intuition (plus my limited knowledge of calculus) lead me to believe that this value should be $-\frac{t_j}{o_j}$.

However, this does not seem to be correct. Cross Entropy is often used in tandem with the softmax function, such that $$o_j = \frac{e^{z_j}}{\sum_ke^{z_k}}$$ where z is the set of inputs to all neurons in the softmax layer (see here).

From this file, I gather that: $$\frac{\delta o_j}{\delta z_j} = o_j(1 – o_j)$$

According to this question: $$\frac{\delta E}{\delta z_j} = t_j – o_j$$
But this conflicts with my earlier guess of $\frac{\delta E}{\delta o_j}$. Why?

$$\frac{\delta E_j}{\delta z_j}=\frac{\delta E_j}{\delta o_j}\frac{\delta o_j}{\delta z_j}$$$$\Rightarrow\frac{\delta E_j}{\delta o_j}=\frac{\delta E_j}{\delta z_j}\div\frac{\delta o_j}{\delta z_j}$$
$$= \frac{t_j-o_j}{o_j(1-o_j)}$$ in direct contradiction to my earlier solution of $$-\frac{t_j}{o_j}$$
So which (if either) solution to $\frac{\delta E_j}{\delta o_j}$ is correct, and why?

Best Answer

Your $\frac{\partial E}{\partial o_j}$ is correct, but $\frac{\partial E}{\partial z_j}$ should be $$\frac{\partial E}{\partial z_j}=\sum_i\frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial z_j}$$ when $i=j$, using the results given in the post we have $$\frac{\partial E}{\partial o_j}\frac{\partial o_j}{\partial z_j}=-\frac{t_j}{o_j}o_j(1-o_j)=t_jo_j-t_j$$ when $i\neq j$ $$\frac{\partial o_i}{\partial z_j}=\frac{\partial \frac{e^{z_i}}{\sum_ke^{z_k}}}{\partial z_j}=-\frac{e^{z_i}}{(\sum_ke^{z_k})^2}e^{z_j}=-o_io_j$$ $$\frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial z_j}=-\frac{t_i}{o_i}(-o_io_j)=t_io_j$$ so the summation is $$\frac{\partial E}{\partial z_j}=\sum_i\frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial z_j}=\sum_it_io_j-t_j$$ since $t$ is a one-hot vector, $\sum_it_i=1$ therefore $$\frac{\partial E}{\partial z_j}=o_j-t_j$$ also see this question.