Solved – Why is optimal learning rate obtained from analyzing gradient descent algorithm rarely (never) used in practice

gradient descentmachine learning

Why is optimal learning rate obtained from analyzing gradient descent algorithm rarely (never) used in practice?

Gradient descent procedure is to iteratively do $a(k+1) = a(k) – \eta(k)\nabla J(a(k))$. Expanding $J(a(k+1))$ using $2^{nd}$ order Taylor expansion and taking the derivative with respect to $\eta$, one obtain the optimal learning rate of $$\eta^{opt} = \frac{||\nabla J||^2}{\nabla J^T H \nabla J}$$ where $H$ is the second order derivative of the cost function.

However, I have not seen this being used in any learning algorithm that employs gradient descent like SVM or perceptron. Is there any reason for that? Or is it implicitly employed in a way that I am not aware of. If so, can anyone illustrate the math involved?

Best Answer

It's not used because it's counter productive.

Just about the only justification for using gradient descent (and it's really not a good justification at all, as you will see if you read through some of the posts on the topic on this site) is that one avoids needing to calculate the Hessian, as this can be very expensive for high dimensional problems. So once you've calculated the Hessian, you've taken away gradient descent's strength: not needing to calculate the Hessian.

If you have calculated the Hessian, you're better of using something like Newton's Method.