Solved – How to calculate the derivative of crossentropy error function

derivativeloss-functionsmachine learningneural networksoptimization

I'm reading this tutorial (presented below) on computing derivative of crossentropy. The author used the loss function of logistic regression I think.
https://www.dropbox.com/s/rxrtz3auu845fuy/Softmax.pdf?dl=0

Most of the equations make sense to me except one thing. In the second page, there is:
$$\frac{\partial E_x}{\partial o^x_j}=\frac{t_j^x}{o_j^x}+\frac{1-t_j^x}{1-o^x_j}$$
However in the third page, the "Crossentropy derivative" becomes

$$\frac{\partial E_x}{\partial o^x_j}=-\frac{t_j^x}{o_j^x}+\frac{1-t_j^x}{1-o^x_j}$$

There is a minus sign in $E_x$. Then the derivative should be $\frac{\partial E_x}{\partial o^x_j}=-\frac{t_j^x}{o_j^x}-\frac{1-t_j^x}{1-o^x_j}$. But it is not. What have I missed?

The tutorial:

Best Answer

There is indeed a mistake:\begin{align} \frac{\partial E_x}{\partial o_j^x} &=\frac{\partial }{\partial o_j^x} \left( - \sum_{k}[t_k^x \log(o_k^x)] + (1-t_k^x) \log(1-o_k^x)]\right) \\ &=-\frac{\partial }{\partial o_j^x} \left( \sum_{k}[t_k^x \log(o_k^x)] + (1-t_k^x) \log(1-o_k^x)]\right) \\ &=-\frac{\partial }{\partial o_j^x} \left( [t_j^x \log(o_j^x)] + (1-t_j^x) \log(1-o_j^x)]\right) \\ &=- \left( \frac{t_j^x}{o_j^x} - \frac{1-t_j^x}{1-o_j^x}\right), \text{Chain rule} \\ &=- \frac{t_j^x}{o_j^x} + \frac{1-t_j^x}{1-o_j^x} \\ \end{align}

Notation:

I'll follow the notation in this made-up example of color classification:

whereby $j$ is the index denoting any of the $K$ output neurons - not necessarily the one corresponding to the true, ($t)$, value. Now,

$$\begin{align} o_j&=\sigma(j)=\sigma(z_j)=\text{softmax}(j)=\text{softmax (neuron }j)=\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\\[3ex] z_j &= \mathbf w_j^\top \mathbf x = \text{preactivation (neuron }j) \end{align}$$

The loss function is the negative log likelihood:

$$E = -\log \sigma(t) = -\log \left(\text{softmax}(t)\right)$$

The negative log likelihood is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.

Gradient of the loss function with respect to the pre-activation of an output neuron:

$$\begin{align} \frac{\partial E}{\partial z_j}&=\frac{\partial}{\partial z_j}\,-\log\left( \sigma(t)\right)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(t)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(z_j)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\frac{e^{z_t}}{\displaystyle\sum_k e^{z_k}}\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left[ \frac{\frac{\partial }{\partial z_j }e^{z_t}}{\displaystyle \sum_K e^{z_k}} \quad - \quad \frac{e^{z_t}\quad \frac{\partial}{\partial z_j}\displaystyle \sum_K e^{z_k}}{\left[\displaystyle\sum_K e^{z_k}\right]^2}\right]\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left[ \frac{\delta_{jt}\;e^{z_t}}{\displaystyle \sum_K e^{z_k}} \quad - \quad \frac{e^{z_t}}{\displaystyle\sum_K e^{z_k}} \frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\right]\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left(\delta_{jt}\sigma(t) - \sigma(t)\sigma(j) \right)\\[2ex] &= - (\delta_{jt} - \sigma(j))\\[2ex] &= \sigma(j) - \delta_{jt} \end{align}$$

This is practically identical to $\frac{\partial E} {\partial z_j} = o_j - t_j$, and it does become identical if instead of focusing on $j$ as an individual output neuron, we transition to vectorial notation (as indicated in your question), and $t_j$ becomes the one-hot encoded vector of true values, which in my notation would be $\small \begin{bmatrix}0&0&0&\cdots&1&0&0&0_K\end{bmatrix}^\top$.

Then, with $\frac{\partial E} {\partial z_j} = o_j - t_j$ we are really calculating the gradient of the loss function with respect to the preactivation of all output neurons: the vector $t_j$ will contain a $1$ only in the neuron corresponding to the correct category, which is equivalent to the delta function $\delta_{jt}$, which is $1$ only when differentiating with respect to the pre-activation of the output neuron of the correct category.

In the Geoffrey Hinton's Coursera ML course the following chunk of code illustrates the implementation in Octave:

%% Compute derivative of cross-entropy loss function.
error_deriv = output_layer_state - expanded_target_batch;

The expanded_target_batch corresponds to the one-hot encoded sparse matrix with corresponding to the target of the training set. Hence, in the majority of the output neurons, the error_deriv = output_layer_state $(\sigma(j))$, because $\delta_{jt}$ is $0$, except for the neuron corresponding to the correct classification, in which case, a $1$ is going to be subtracted from $\sigma(j).$

The actual measurement of the cost is carried out with...

% MEASURE LOSS FUNCTION.
CE = -sum(sum(...
  expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;

We see again the $\frac{\partial E}{\partial z_j}$ in the beginning of the backpropagation algorithm:

$$\small\frac{\partial E}{\partial W_{hidd-2-out}}=\frac{\partial \text{outer}_{input}}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial \text{outer}_{input}}=\frac{\partial z_j}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial z_j}$$

hid_to_output_weights_gradient =  hidden_layer_state * error_deriv';
output_bias_gradient = sum(error_deriv, 2);

since $z_j = \text{outer}_{in}= W_{hidd-2-out} \times \text{hidden}_{out}$

Observation re: OP additional questions:

The splitting of partials in the OP, $\frac{\partial E} {\partial z_j} = {\frac{\partial E} {\partial o_j}}{\frac{\partial o_j} {\partial z_j}}$, seems unwarranted.

The updating of the weights from hidden to output proceeds as...

hid_to_output_weights_delta = ...
 momentum .* hid_to_output_weights_delta + ...
 hid_to_output_weights_gradient ./ batchsize;
hid_to_output_weights = hid_to_output_weights...
 - learning_rate * hid_to_output_weights_delta;

which don't include the output $o_j$ in the OP formula: $w_{ij} = w'_{ij} - r{\frac{\partial E} {\partial z_j}} {o_i}.$ The formula would be more along the lines of...

$$W_{hidd-2-out}:=W_{hidd-2-out}-r\, \small \frac{\partial E}{\partial W_{hidd-2-out}}\, \Delta_{hidd-2-out}$$

Loss Functions – How to Calculate Derivative of Cross Entropy Loss Function?

$$\nabla L = \begin{pmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_n} \end{pmatrix}$$

This requires computing the derivatives of the terms like

$$\log {1 \over {1+e^{-\vec x \cdot \vec w}}} = \log {1 \over {1+e^{-(x_1 \cdot w_1 + x_2 \cdot w_2 + \, \dots \, + x_n \cdot w_n)}}}$$

where you can use

$$\frac{\partial}{\partial x} \left( \log\frac{1}{1+e^{-(a+bx)}} \right) = \frac{b}{1+e^{(a+bx)}}$$

and

$$\frac{\partial}{\partial x} \left( \log(1-\frac{1}{1+e^{-(a+bx)}} ) \right) = \frac{b}{1-e^{-(a+bx)}}$$

Filling that in you get

$$\frac{\partial}{\partial w_j} L = \frac{\bar{y_i} x_j}{{1+e^{\vec x \cdot \vec w}}} - \frac{(1-\bar{y_i}) x_j }{{1-e^{-\vec x \cdot \vec w}}}$$

and

$$\nabla L = \left( \frac{\bar{y_i}}{{1+e^{\vec x \cdot \vec w}}} - \frac{(1-\bar{y_i}) }{{1-e^{-\vec x \cdot \vec w}}} \right) \vec{x} $$

There is a minus sign in $E_x$. Then the derivative should be $\frac{\partial E_x}{\partial o^x_j}=-\frac{t_j^x}{o_j^x}-\frac{1-t_j^x}{1-o^x_j}$. But it is not. What have I missed?

Best Answer

Related Solutions

Solved – Neural network softmax activation

Notation:

Gradient of the loss function with respect to the pre-activation of an output neuron:

Loss Functions – How to Calculate Derivative of Cross Entropy Loss Function?

Related Question