Solved – Adam (adaptive) optimizer(s) learning rate tuning

I'm reading Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow and on page 325 (follows up on 326) there's a following piece of text on learning-rate:

The learning is arguably the most important parameter. In general, the
optimal learning rate is about half of the maximum learning rate (i.e.
the learning rate above which the training algorithm diverges, as we
saw in Chapter 4). One way to find a good learning rate is to train
the model for a few hundred iterations, starting with a very low
learning rate (e.g., 1e-5) and gradually increasing it up to a very
large value (e.g., 10). This is done by multiplying the learning rate
by a constant factor at each iteration (e.g., by exp(1e6/500) to go
from 1e-5 to 10 in 500 iterations). If you plot the loss as a function
of the learning rate (using log scale for a learning rate), you should
see it dropping at first. But after a while, the learning rate will be
too large, so the loss will shoot back up: the optimal learning rate
will be a bit lower than the point at which the loss starts to climb
(typically about 10 times lower than the turning point). You can then
reinitialize your model and train it normally using this good learning
rate. (…)

My question is: does it apply to any group of optimizers or to SGD in particular?

Best Answer

Adam is an adaptive algorithm, so it self-tunes during the training. In many cases you would get away with the default hyperparameters and they would not need tuning. As you can learn from this thread sometimes tuning the learning rate may lead to improvements, but also the range of known best values is smaller as compared to other algorithms. However it should usually not be your first concern. Also notice that for $\beta_1$ and $\beta_2$ hyperparameters the general advice is not to change the defaults, you should do it only when you have a good reason for that.

Best Answer

Related Solutions

Solved – Is manually tuning learning rate during training redundant with optimization methods like Adam

Solved – Sigmoid activation hurts training a NN on pyTorch

Related Question