Solved – Why don’t we use non-constant learning rates for gradient decent for things other then neural networks

deep learninggradient descentmachine learningoptimization

Deep learning literature is full of clever tricks with using non-constant learning rates in gradient descent. Things like exponential decay, RMSprop, Adagrad etc. are easy to implement and are available in every deep learning package, yet they seem to be nonexistent outside of neural networks. Is there any reason for this? If it is that people simply don't care, is there a reason why don't we have to care outside of neural networks?

Best Answer

Disclaimer: I don't have so much experience with optimization outside of neural networks, so my answer will be clearly biased, but there are several things that play role:

  • (Deep) neural networks have a lot of parameters. This has several implications:

    Firstly, it kind-of rules out higher order methods simply because computing Hessian and higher derivatives becomes infeasible. In other domains, this may be a valid approach better than any tweaks to SGD.

    Secondly, although SGD is wonderful, it tends to be impractically slow. These improved SGD variants mainly enable faster training, while potentially losing some of the nice properties of SGD. In other domains, the SGD training time may not be the bottleneck, so improvements gained by speeding it up may be simply negligible.

  • Training (deep) neural networks is non-convex optimization and I am not aware of significant convex relaxation results in the field. Unlike other fields, neural networks are not focused on provably globally optimal solutions, which leads to investing more efforts into improving the properties of loss surface and its traversal during the optimization.

    In other fields, employing convex relaxation and obtaining globally optimal solutions may be in the center of the interest instead of the optimization algorithm, because once the problem is defined as a convex problem, the choice of the optimization algorithm cannot improve the quality of the solution.

I suppose this answer does not cover all possible aspects and I am myself curious about other opinions.