Solved – Why nowadays ML algorithm rarely use optimizing functions based on newton method

deep learninggradient descentmachine learningMATLAB

I am currently working on coursera Andrew Ng's machine learning course, and wonder why nowadays deep learning algorithms do not use fminunc (assuming a Matlab/Octave environment) for optimizing weights.

In my knowledge, fminunc does not have to set learning rate so I think it would be convenient to optimize weights compared to gradient-descent models.

Is there any disadvantages when i use fminunc instead of gradient descent optimizing functions?

Best Answer

I assume by fminuc, you assume the function from Matlab or Octave. I took the liberty of editing your question to add the corresponding tags. If I do this in octave

>> help fminunc

among other things, I get this line

Function File: [X, FVAL, INFO, OUTPUT, GRAD, HESS] = fminunc (FCN,
     ...)

This doesn't tell me what algorithm is exactly used by this function, but there is one alarming variable that screams that this function is not fit for training neural networks: the varible HESS.

So, this function computes the Hessian. This does not surprise me since in your question, you said that it did not need a learning rate. Minimization functions that do not need a learning rate need to be, simply put, aware of the curvature of the cost function. Well known examples are Gauss-Newton and Levenberg-Marquardt in which the Hessian is explicitly computed. On the other hand, you have other, lighter algorithms like l-bfgs that only approximate the Hessian. Eitherway, this is way too expensive. Imagine your cost function is

$\begin{equation} F:\mathbb{R}^n\rightarrow\mathbb{R}^m\end{equation}$

then, its Hessian is of size $n \times n$. That is ok if you have a reasonnably small function. But, in deep learning, you easily can end up with millions of parameters, i.e. $n=10^6$. So the Hessian is way to big to compute! This should also answer your question on Newton's method, since its very essence is to use the Hessian.

This is why algorithms like stochastic gradient descent are popular in machine learning. Plus, don't forget that even normal gradient descent is impossible in deep learning, because you have to approximate the gradient of the cost function using batches.

Also, one could argue that this line from the documentation

'fminunc' attempts to determine a vector X such that 'FCN (X)' is a
 local minimum.

goes agains what you want in deep learning, since you seek the global minimum. However, as this paper seems to indicate (I might be wrong, I haven't read all the details), this is probably not a problem.