Solved – How to derive the recursive equation for back propagation for neural networks for $\delta_j = \frac{\partial E_n}{ \partial a_j} $

machine learningmathematical-statisticsneural networks

I am following the derivation for back propagation presented in Bishop's book Pattern Recognition and Machine Learning and had some confusions in following the derivation presented in section 5.3.1.

In that chapter they present the application of the chain rule for partial derivatives on the definition of $\delta_j$ and get equation 5.55:

$$ \delta_j \equiv \frac{\partial E_n}{ \partial a_j} = \sum_k \frac{\partial E_n}{ \partial a_k} \frac{\partial a_k}{ \partial a_j} $$

where the sum runs over all units $k$ to which unit $j$ sends connections.

My question is in how they get from the equation 5.55 to equation 5.56:

$$ \delta_j = h'(a_j) \sum_k w_{kj} \delta_k$$

In the chapter of the book they do try to explain how that equation came about with the following paragraph:

If we now substitute the definition of $\delta$ given by equation (5.51) $\delta \equiv \frac{\partial E_n}{ \partial a_j}$ into equation (5.55) $ \delta_j \equiv \frac{\partial E_n}{ \partial a_j} = \sum_k \frac{\partial E_n}{ \partial a_k} \frac{\partial a_k}{ \partial a_j} $ and make use of (5.48) $a_j = \sum_i w_{ji} z_i$ and (5.49) $z_j = h(a_j) $, we obtain the following backpropagation formula (5.56) $ \delta_j = h'(a_j) \sum_k w_{kj} \delta_k$

Basically, its not 100% clear how they used all those steps to get $ \delta_j = h'(a_j) \sum_k w_{kj} \delta_k$ from $ \delta_j \equiv \frac{\partial E_n}{ \partial a_j} = \sum_k \frac{\partial E_n}{ \partial a_k} \frac{\partial a_k}{ \partial a_j} $.

I've tried applying those steps and I will show what I have tried so far:

First I substituted the definition of $\delta$ to the multivarable chain rule to get from $ \delta_j \equiv \frac{\partial E_n}{ \partial a_j} = \sum_k \frac{\partial E_n}{ \partial a_k} \frac{\partial a_k}{ \partial a_j} $ to:

$$ \delta_j = \sum_k \delta_k \frac{\partial a_k}{ \partial a_j} $$

then I guessed that they some how used the chain rule again on $ \frac{\partial a_k}{ \partial a_j} $ and somehow involved $\frac{ \partial h(a_j) }{\partial a_j} = h'(a_j)$ to it and then substituted it back. Though that is not 100% clear to me how it was done. Does anyone have an idea how that was done?


As a reference I will paste the relevant section of the book to help:

enter image description here

enter image description here

Best Answer

The first three steps are just the substitutions given in the explanation.

The fourth step deserves a little explanation. (5.55) expands the chain rule using "all units $k$ to which unit $j$ sends connections." Then, (5.48) expands $a_k$ in terms of its feed forward inputs, the same layer that $a_j$ is from. For example, in a 3 layer neural network, let $a_j$ be one of the hidden layer units. Then $a_k$ would be the output layer units that $a_j$ sends connections to. $a_k$ is computed from the hidden units and that is $a_i$. So $a_j$ and $a_i$ are both the hidden layer. So the partial derivative is zero except when $i=j$ so we are only left with one term remaining.

The last step is because $h'(a_j)$ doesn't depend on $k$.

$$ \begin{align} \delta_j \equiv \frac{\partial E_n}{ \partial a_j} &= \sum_k \frac{\partial E_n}{ \partial a_k} \frac{\partial a_k}{ \partial a_j} \quad (5.55)\\ &= \sum_k \delta_k \frac{\partial a_k}{ \partial a_j} \quad (5.51, \text{definition of } \delta_j)\\ &= \sum_k \delta_k \frac{\partial}{ \partial a_j}\big(\sum_i w_{ki}z_i) \quad (5.48)\\ &= \sum_k \delta_k \frac{\partial}{ \partial a_j}\big(\sum_i w_{ki}h(a_i)) \quad (5.49)\\ &= \sum_k \delta_k w_{kj}h'(a_j)\\ &= h'(a_j)\sum_k \delta_k w_{kj} \end{align} $$

Related Question