Solved – How to choose between SGD with Nesterov momentum and Adam

adamnesterovneural networksstochastic gradient descent

I'm currently implementing a neural network architecture on Keras. I would like to optimize the training time, and I'm considering using alternative optimizers such as SGD with Nesterov Momentum and Adam.

I've read several things about the pros and cons of each methods (SGD with Nesterov momentum: very sensitive to initial value of the learning rate, and requires learning rate scheduling –I'm not sure what it means…), but I still do not know how they compare, and how to choose between them.

Can someone help me clarifying this points? Thanks!

Best Answer

In general, there aren't definitive results on one learning algorithm being "better" than another. The common wisdom (which needs to be taken with a pound of salt) has been that Adam requires less experimentation to get convergence on the first try than SGD and variants thereof. However, this is highly dataset/model dependent. More recently some groups have made the claim that despite training faster, models trained with Adam generalize worse to the validation or test sets. See https://arxiv.org/abs/1712.07628 for more information.

Related Question