Solved – Why use gradient descent with neural networks

backpropagationgradient descentneural networks

  1. When training a neural network using the back-propagation algorithm, the gradient descent method is used to determine the weight updates. My question is: Rather than using gradient descent method to slowly locate the minimum point with respect to a certain weight, why don't we just set the derivative $\frac{d(\text{Error})}{dw}=0$, and find the value of weight $w$ which minimizes the error?

  2. Also, why are we sure that the error function in back-propagation will be a minimum? Can't it turn out the error function is a maximum instead? Is there a specific property of the squashing functions that guarantees that a network with any number of hidden nodes with arbitrary weights and input vectors will always give an error function that has some minima?

Best Answer

  1. Because we can't. The optimization surface $S(\mathbf{w})$ as a function of the weights $\mathbf{w}$ is nonlinear and no closed form solution exists for $\frac{d S(\mathbf{w})}{d\mathbf{w}}=0$.

  2. Gradient descent, by definition, descends. If you reach a stationary point after descending, it has to be a (local) minimum or a saddle point, but never a local maximum.

Related Question