-
When training a neural network using the back-propagation algorithm, the gradient descent method is used to determine the weight updates. My question is: Rather than using gradient descent method to slowly locate the minimum point with respect to a certain weight, why don't we just set the derivative $\frac{d(\text{Error})}{dw}=0$, and find the value of weight $w$ which minimizes the error?
-
Also, why are we sure that the error function in back-propagation will be a minimum? Can't it turn out the error function is a maximum instead? Is there a specific property of the squashing functions that guarantees that a network with any number of hidden nodes with arbitrary weights and input vectors will always give an error function that has some minima?
Solved – Why use gradient descent with neural networks
backpropagationgradient descentneural networks
Best Answer
Because we can't. The optimization surface $S(\mathbf{w})$ as a function of the weights $\mathbf{w}$ is nonlinear and no closed form solution exists for $\frac{d S(\mathbf{w})}{d\mathbf{w}}=0$.
Gradient descent, by definition, descends. If you reach a stationary point after descending, it has to be a (local) minimum or a saddle point, but never a local maximum.