Solved – the reason that the Adam Optimizer is considered robust to the value of its hyper parameters

adamdeep learninghyperparameterneural networksoptimization

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Bengio, Goodfellow and Courville:

Adam is generally regarded as being fairly robust to the choice of hyper parameters, though the learning rate sometimes needs to be changed from the suggested default.

if this is true its a big deal because hyper parameter search can be really important (in my experience at least) in the statistical performance of a deep learning system. Thus, my question is, why is Adam Robust to such important parameters? Specially $\beta_1$ and $\beta_2$?

I've read the Adam paper and it doesn't provide any explanation to why it works with those parameters or why its robust. Do they justify that elsewhere?

Also, as I read the paper, it seems that the number of hyper parameters they tried where very small, for $\beta_1$ only 2 and for $\beta_2$ only 3. How can this be a thorough empirical study if it only works on 2×3 hyper parameters?

Best Answer

In regards to the evidence in regards to the claim, I believe the only evidence supporting the claim can be found on figure 4 in their paper. They show the final results under a range of different values for $\beta_1$, $\beta_2$ and $\alpha$.

Personally, I don't find their argument convincing, in particular because they do not present results across a variety of problems. With that said, I will note that I have used ADAM for a variety of problems, and my personal finding is that the default values of $\beta_1$ and $\beta_2$ do seem surprisingly reliable, although a good deal of fiddling with $\alpha$ is required.