Adjusting the Weight Matrix in gradient Descent backpropagation through neural networks

machine learningneural networks

In many gradient descent algorithms to backpropagate an error through a neural network the final line looks something like this:

$$ W_{ij} = W_{ij} – \mu \frac{\delta E}{\delta W} $$

i.e. adjust the weights by an amount $\mu$ with magnitude proportional to the gradient of the error with respect to the weights. I have looked quite hard for a good explanation of why this works but I cannot find it anywhere- even some of the papers that defined backpropagation in the 1980s, e.g. by Werbos, Pineda, Hecht-Nielson, only quote this result without explaining it.

In particular, in Pineda 1987, which I believe is the first place the backpropagation algorithm was generally defined, it is stated that the gradient of the weights is directly proportional to he gradient of the error:

$$ \frac{\delta W_{ij}}{\delta t} = -\mu \frac{\delta E}{\delta W_{ij}} $$

And Pineda justifies this by stating, to ensure that the output of the network converges towards the target values, "let the system evolve in the weight space long trajectories that are antiparallel to the gradient of E."

Perhaps there is a simple explanation that I a overlooking as to why this would ensure the convergence?

Best Answer

It works because we combine three things:

  1. An objective we want to minimize ( error squared ).
  2. A differentiable function (often a sigmoid) and
  3. The chain rule of differentiation.

A Neural network is a series of linear combinations of 2). By linearity of the differentiation operation we can apply the chain rule on such linear combinations.

Related Question