Solved – Why is optimisation solved with gradient descent rather than with an analytical solution?

calculusmachine learningneural networksoptimization

I'm trying to understand why, when trying to minimise an objective function, gradient descent is often used, rather than setting the gradient of the error to zero, and solving it analytically.

In school, I was taught that in order to minimise a function, you take it's first derivative with respect to the parameters, and then set this to zero. Solving this equation then yields one or more solutions (local minima), and you can try each one out to find which is the global minimum.

However, when learning about neural networks, I have seen that training is carried out by using gradient descent (or back-propagation in the case of multi-layer networks). Here, the gradient of the error function is taken, but instead of setting this to zero, you just move the parameters a small amount in the direction of this gradient.

I don't understand why the analytical approach isn't used, given that it has the obvious strength that the solution is perfect, and guarantees returning the global minimum, rather than potentially a local minimum as with gradient descent. My intuitive reason why the analytical approach isn't used is that it becomes very hard to solve an equation with all these parameters by hand, but, perhaps naively, I would have thought that there were automated ways to solve these kind of complex equations these days? Wouldn't the effort in enabling this be worth it, considering the better results, compared to the effort in getting gradient descent to work only to get inferior results?

Thank you 🙂

Best Answer

The system of equations you get setting the derivatives equal to zero cannot generally be solved analytically. For instance suppose I want choose $10>x>0$ to minimize $xln(x)-\sqrt{x}$ (which does attain a minimum in $10>x>0$) then our first order condition is $ln(x)+1-\frac{1}{2\sqrt{x}}=0$ which does not admit a closed form solution. Of course, we can try and find a solution using numerical methods, but that is precisely what gradient descent does!

Related Question