[Math] Why do we use gradient descent in the backpropagation algorithm

neural networksoptimization

The common approach for training neural networks, as far as I know, is the backpropagation algorithm, which uses gradient descent to reduce the error.

(i) Why should one use a fixed learning rate / simulated annealing over, let's say, Armijo rule?

(ii) Is there a good reason to use gradient descent in this case, or did that just grow historically? Especially, is gradient descent in this case favorable over algorithms like Newton's (Quasi-Newton, globalized Newton)?

Best Answer

Backpropagation algorithm IS gradient descent and the reason it is usually restricted to first derivative (instead of Newton which requires hessian) is because the application of chain rule on first derivative is what gives us the "back propagation" in the backpropagation algorithm. Now, Newton is problematic (complex and hard to compute), but it does not stop us from using Quasi-newton methods especially BFGS (I believe many neural network software packages already use BFGS as part of their training these days).

As for fixed learning rate, it need not be fixed at all. There are papers far back as '95 reporting on this (Search for "adaptive learning rate backpropagation").

Related Question