Solved – Softmax with log-likelihood cost

likelihoodmachine learningneural networkssoftmax

I am working on my understanding of neural networks using Michael Nielsen's "Neural networks and deep learning."

Now in the third chapter, I am trying to develop an intuition of how softmax works together with a log-likelihood cost function.
http://neuralnetworksanddeeplearning.com/chap3.html

Nielsen defines the log-likelihood cost associated with a training input (eq. 80) as
$$C \equiv -\ln{a_y^L}$$
where $a_y^L$ is the activation for the desired output ($L$ being the index of the last layer). Nielsen claims that if we apply the softmax function to the last layer $$a_j^L= {e^{z_j^L} \over \sum_k e^{z_k^L}}$$
where $z_j^L$ is the weighted input for the $j$th neuron in the output layer, we get
$${\partial C \over \partial b_j^L} = a_j^L – y_j$$
and
$${\partial C \over \partial w_{jk}^L} = a_k^{L-1}(a_j^L – y_j)$$
where $b_j^L$ is the bias of the $j$th neuron in the output layer and $w_{jk}^L$ is the weight between the $k$th neuron in the last but one layer and the $j$th neuron in the last layer.

How does he arrive at this result? Aren't we supposed to measure the cost only for the desired output $y$? In the two last equations we seem to be doing this over all the output neurons. I am aware of the implications of backpropagation, for instance of
$${\partial C \over \partial b_j^L}={\partial C \over \partial z_j^L}={\partial C \over \partial a_j^L}{\partial a_j^L \over \partial z_j^L}$$
however, I am still missing how we get the partial derivatives with respect to weights and biases.

Best Answer

I have not read the book you mentioned. But it seems you probably missed the point of backpropagation. Yes, you only need to measure the cost of the output usually, but in order to train your model (i.e., get the desirable weights and biases), you have to compute the gradients of the cost w.r.t. all the weights and biases. That is why you have the last equations. Now for example, you get: $$ {\partial C \over \partial b_j^L}={\partial C \over \partial z_j^L}{\partial z_j^L \over \partial b_j^L}=(\sum_k{\partial C \over \partial a_k^L}{\partial a_k^L \over \partial z_j^L}){\partial z_j^L \over \partial b_j^L} $$ where you know all the functions of $C(a), a(z)$, and $z(b)$. And you have, $$ {\partial C \over \partial a_k^L} = {\partial{-\ln{a_y^L}} \over \partial a_k^L} = \begin{cases} - {\frac 1 {a_y^L}}, &k=y \\ 0, & k!=y \end{cases} $$

$$ {\partial a_k^L \over \partial z_j^L} = \begin{cases} a_j^L(1-a_j^L), &j=k \\ -a_k^L a_j^L,&j!=k \end{cases} $$

$$ {\partial z_j^L \over \partial b_j^L} = 1 $$

Then you have, $$ {\partial C \over \partial b_j^L} = \begin{cases} a_j^L-1, &j=y \\ a_j^L,&j!=y \end{cases}, \space $$ i.e., $$ {\partial C \over \partial b_j^L} = a_j^L - 1_{j = y} = a_j^L - y_j $$