Solved – Is manually tuning learning rate during training redundant with optimization methods like Adam

adammachine learningneural networksoptimization

I have seen some high-profile deep learning papers where an optimization method like Adam was used, yet the learning rate was manually changed at specific iterations.

What is the relationship between the adaptivity provided by adaptive optimization methods and manually tuning the learning rate? Would it still make sense with Adam to, for example, lower the learning rate after not seeing improvement for a number of iterations?

Best Answer

Yes, it is good practice to tune the learning rate even with Adam.

Most variants of SGD which claim to be "adaptive", including optimizers like Adagrad and Adam, adjust the relative learning rates of parameters.

The update rules for many of these adaptive SGD variants involves an update very similar to the following:

$$\Delta w = \frac{1}{r} \frac{\partial{f}}{\partial w}$$

where $r$ usually is some sort of accumulating average of $\left| \frac{\partial{f}}{\partial w} \right|$ over time. (This is a very rough sketch of how it works!)

Since $r$ is different for every parameter, this means that even if one parameter has a really large gradient and another has a very small gradient, they are updated at the same rate.

However, if the gradient eventually becomes small, then $1/r$ will become large, which means the $\Delta w$ will stay roughly the same in size. Therefore, it is still necessary to lower the learning rate in order to achieve good results.

To summarize, the type of adaptivity provided by algorithms like Adam deal with adjusting the relative learning rates of different parameters, not decreasing the learning rate over time, so it's a different type of adaptive.

There are however other algorithms, such as YellowFin, which claim to not need tuning of any parameters at all. https://arxiv.org/abs/1706.03471

Best Answer

Related Solutions

Solved – the reason that the Adam Optimizer is considered robust to the value of its hyper parameters

Solved – No change in accuracy using Adam Optimizer when SGD works fine

Related Question