Adam is one popular method of the optimization policies with adaptive learning rate. I'm focusing on a image segmentation project using fully convolutional networks. All weights were initialized by truncated normal distributions. Initially, I used the Adam optimizer, and got some convergence of loss on both training sets and test sets with reasonable accuracies (say 0.8). But when I tried to use the SGD optimizer, the loss seems converged, but the accuracy is nearly zero. So my question is, when we adopted different optimizers, what do we need to change for successful network training? Weight initialization?
Solved – What should we do when changing SGD optimizer to Adam optimizer
adamconv-neural-networkneural networksoptimizationstochastic gradient descent
Best Answer
In my experience, changing optimizers is not a simple matter of swapping one for the other. Instead, changing optimizers also interacts with several other configuration choices in the neural network.
The optimizer interacts with the initialization scheme, so this might need to be changed.
The learning rate may need to be changed.
The learning rate schedule may need to be adjusted.
In some cases, SGD with momentum can be a big improvement over Adam. See: "The Marginal Value of Adaptive Gradient Methods in Machine Learning" by Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht