Why does Adam optimizer seems to prevail over Nadam optimizer?

adamgradient descentneural networksoptimization

I have been studying the way Adam optimizer works, and how it combines both RMSProp and Momentum optimizers.

So the following question arises: Why not combine Nesterov Accelerated Gradient together with RMSProp? Wouldn't that yield better results? And by doing a quick search, one finds that, indeed, there exists an optimizer called Nadam that does precisely that: combine NAG with RMSProp. And it seems to yield better results as it can be checked in the following links:
[Paper]
[Report]

I have even seen that Keras already has an implementation for Nadam: Keras-Nadam

My question is, why does the deep learning community still prefer Adam optimizer? Why is Adam still the most established optimizer, when in my opinion, Nadam makes more sense?

If it was seen that Nesterov was an improvement over Momentum, why not use Nadam?

Thanks in advance!

Best Answer

There is no single optimizer that beats all the others. If you look at the published papers, you would see different optimizes used. People often still use stochastic gradient descent, you can find a nice discussion on this on Quora. There are results suggesting that the basic SGD may generalize better (Hardt, Recht, & Singer, 2016). So neither does the deep learning community prefer Adam (though it is popular given that it has proven to give decent results), not Adam is guaranteed to outperform Adam or other optimizers.