Solved – Back-propagation in Neural Nets with >2 hidden layers

gradient descentneural networks

We have the following update formulas:

Output layer (indexed by $k \in \{1, \dots, \text{number of classes} \}$).

$o_k$ – value of output neuron $k$

$d_k$ – desired value of output neuron $k$

$\eta$ – learning rate

$x_j$ – value of neuron on the last hidden layer

$$\delta_k = o_k(1 – o_k)(d_k – o_k)$$
$$\omega_{jk} = \omega_{jk} + \Delta\omega_{jk}, \text{where } \Delta\omega_{jk} = \eta\delta_kx_j$$

Last hidden layer (HL) (indexed by $j \in \{1, \dots, \text{number of hidden neurons in the last HL} \}$):
$$\delta_j = x_j(1 – x_j)\sum_k \omega_{jk}\delta_k$$
$$\omega_{ij} = \omega_{ij} + \Delta\omega_{ij}, \text{where } \Delta\omega_{ij} = \eta\delta_j x_i$$

Does this update formulas hold generally for any number of hidden layers? I.e.:

$l$-th hidden layer ($l \in [0..L]$, where 0 – input layer, $L$ – output layer) (indexed by $p \in \{1, \dots, \text{number of hidden neurons in layer } l\},$

$j \in \{\text{ 1,..,number of hidden neurons in layer } l+1\}$

$$\delta_p = x_p(1 – x_p)\sum_j \omega_{pj}\delta_j$$
$$\omega_{pj} = \omega_{pj} + \Delta\omega_{pj}, \text{where } \Delta\omega_{pj} = \eta\delta_j x_p$$

If it is true, it's possible to recursively compute all $\delta$s for all neurons (except the input ones) starting with output and going deeper to the hidden nodes and then simply modify the each weight based on found $\delta$ and $x$. Correct?

Best Answer

This is just a simple computation of the partial derivative and observation, that the derivative on the layer $i$ (from top) can be fully computed using partial derivative for weights in layer $i-1$. This applies to any number of layers, but this leads to so called "vanishing gradient phenomenon" which is a reason for not using multiple hidden layers in general (at least with basic architecture and basic training). To overcome this issue, deep learning has been proposed in recent years (like for example Deep Convolutional Networks, Deep Belief Networks, Deep Autoencoders, Deep Boltzmann Machines etc.)