Solved – Intuition behind Backpropagation gradients

backpropagationgradientmachine learningneural networks

I'm currently taking Andrew Ng's Machine Learning Coursera course, and I'm not sure that I fully understand the Delta gradients in Backpropagation (BP).

I see that for the first layer of BP (which is really the output layer of the network) we can calculate the gradient by subtracting the real output label vector from the output vector generated by the network. However, there is a lot of reference to these delta terms as gradients and partial derivatives. Specifically, he says:

enter image description here

I'm not sure why we are computing derivatives here, and moreover why computing consecutive derivatives for each layer results in overall better weights (especially if we end up just adding them all to some "accumulator":

enter image description here

Any insight would be awesome.

A link to a one page PDF summery of the lesson is here, if it helps: https://www.coursera.org/learn/machine-learning/supplement/pjdBA/backpropagation-algorithm

Best Answer

Backprop is used to compute the gradient of the loss function--that is, a vector containing the partial derivative of the loss function with respect to each parameter of the network. The mechanics of backprop are equivalent to the chain rule from calculus.

The gradient is used to update the weights according to some learning rule, whose job is reduce the value of the loss function. Many learning rules are possible, but one of the simplest and most widely used is gradient descent. The gradient at each point in parameter space is a vector that points in the direction in which the loss function increases most steeply. At each iteration, gradient descent takes a step in the direction opposite the gradient--that is, it steps in the direction of steepest descent, thereby reducing the loss function.