[Math] derivative of cost function for Neural Network classifier

neural networks

I am following Andrew NG's Machine Learning course on Coursera.

The cost function without regularization used in the Neural network course is:

$J(\theta) = \frac{1}{m} \sum ^{m}_{i=1}\sum ^{K}_{k=1} [-y_{k}^{(i)}log((h_{\theta}(x^{(i)}))_{k}) -(1-y_{k}^{(i)})log(1-(h_{\theta}(x^{(i)}))_{k})]$

, where $m$ is the number of examples, $K$ is the number of classes, $J(\theta)$ is the cost function, $x^{(i)}$ is the i-th training example, $\theta$ are the weight matrices and $h_{\theta}(x^{(i)})$ is the prediction of the neural network for the i'th training example.

I understand intuitively that the backpropagation error associated with the last layer(h) is h-y. Nevertheless, I want to be able to prove this formally.

For simplicity, I considered m = K = 1:

$J(\theta) = -y \log(h_{\theta}) – (1-y) \log(1-h_{\theta})$

and tried to prove this to myself on paper but wasn't able to.

Neural Network Definition:

This neural network has 3 layers. (1 input, 1 hidden, 1 output).

It uses the sigmoid activation function,

$\sigma(z) = \frac{1}{1+e^{-z}}$.
The input is $x$.

Input layer: $a^{(1)} = x$. (add bias $a_{0}^{(1)}$).

Hidden Layer: $z^{(2)} = \Theta^{(1)}a^{(1)}$ , $a^{2} = \sigma(z^{(2)})$, (add bias $a_{0}^{(2)}$).

Output layer: $z^{(3)} = \Theta^{(2)}a^{(2)}$ , $a^{3} = \sigma(z^{(3)}) = h_{\theta}(x)$.

During backpropagation, $\delta^{(3)}$ is the error associated with the output layer.

Question:

  1. Why is it that:

$\delta^{(3)} = h_{\theta} – y$ ?

  1. Shouldn't:

$\delta^{(3)} = \frac{\partial {J}} {\partial {h_{\theta}}}$ ?

Best Answer

First, since your cost function is using the binary cross-entropy error $\mathcal{H}$ with a sigmoid activation $\sigma$, you can see that: \begin{align} \frac{\partial J}{\partial h_\theta} &= \frac{1}{m}\sum_i\sum_k\frac{\partial }{\partial h_\theta}\mathcal{H}\left(y_k^{(i)},h_\theta(x^{(i)})_k\right) \\ &= \frac{1}{m}\sum_i\sum_k \left[ \frac{-y_k^{(i)}}{h_\theta(x^{(i)})_k} + \frac{1-y_k^{(i)}}{1-h_\theta(x^{(i)})_k} \right] \\ &= \frac{1}{m}\sum_i\sum_k \frac{h_\theta(x^{(i)})_k - y_k^{(i)}}{ h_\theta(x^{(i)})_k(1-h_\theta(x^{(i)})_k) } \end{align} Hence, for $m=K=1$, as a commenter notes $$ \frac{\partial J}{\partial h_\theta} = \frac{h_\theta - y}{ h_\theta(1-h_\theta) } $$ But this is not so useful, as it computes how the error changes as the final output changes. What you really want is how the cost changes as the weights $\theta^{(\ell)}_{ij}$ are varied, so you can do gradient descent on them. An intermediate calculation is to compute the variation with respect to the activation $ h_\theta=\sigma(z)$. Let the last layer be $s$. Then the output layer error is: \begin{align} \delta^{(s)}_j &= \frac{\partial J}{\partial z_j^{(s)}}\\ &= \frac{1}{m}\sum_i\sum_k \frac{\partial }{\partial z_j^{(s)}} \mathcal{H}\left(y_k^{(i)},h_\theta(x^{(i)})_k\right) \\ &= \frac{-1}{m}\sum_i\sum_k y_k^{(i)} \frac{1}{h_\theta(x^{(i)})_k}\frac{\partial h_\theta(x^{(i)})_k}{\partial z_j^{(s)}} + (1-y_k^{(i)})\frac{1}{1-h_\theta(x^{(i)})_k}\frac{\partial h_\theta(x^{(i)})_k}{\partial z_j^{(s)}} \\ &= \frac{-1}{m}\sum_i\sum_k [1-h_\theta(x^{(i)})_k]y_k^{(i)} - h_\theta(x^{(i)})_k[1-y_k^{(i)}]\\ &= \frac{1}{m}\sum_i\sum_k h_\theta(x^{(i)})_k -y_k^{(i)} \end{align} using the fact that $$ \frac{\partial h_\theta(x^{(i)})_k}{\partial z_j^{(s)}} = \sigma'(z_j^{(s)}) = \sigma(z_j^{(s)})[1-\sigma(z_j^{(s)})] = h_\theta(x^{(i)})_k[1-h_\theta(x^{(i)})_k] $$ So in the case that $m=K=1$ and $s=3$, we have: $$ \delta^{(3)} = h_\theta - y $$

Related Question