Solved – What should we do when changing SGD optimizer to Adam optimizer

adamconv-neural-networkneural networksoptimizationstochastic gradient descent

Adam is one popular method of the optimization policies with adaptive learning rate. I'm focusing on a image segmentation project using fully convolutional networks. All weights were initialized by truncated normal distributions. Initially, I used the Adam optimizer, and got some convergence of loss on both training sets and test sets with reasonable accuracies (say 0.8). But when I tried to use the SGD optimizer, the loss seems converged, but the accuracy is nearly zero. So my question is, when we adopted different optimizers, what do we need to change for successful network training? Weight initialization?

Best Answer

In my experience, changing optimizers is not a simple matter of swapping one for the other. Instead, changing optimizers also interacts with several other configuration choices in the neural network.

  • The optimizer interacts with the initialization scheme, so this might need to be changed.

  • The learning rate may need to be changed.

  • The learning rate schedule may need to be adjusted.

  • In some cases, SGD with momentum can be a big improvement over Adam. See: "The Marginal Value of Adaptive Gradient Methods in Machine Learning" by Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht

    Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.