Solved – What should we do when changing SGD optimizer to Adam optimizer

adamconv-neural-networkneural networksoptimizationstochastic gradient descent

Adam is one popular method of the optimization policies with adaptive learning rate. I'm focusing on a image segmentation project using fully convolutional networks. All weights were initialized by truncated normal distributions. Initially, I used the Adam optimizer, and got some convergence of loss on both training sets and test sets with reasonable accuracies (say 0.8). But when I tried to use the SGD optimizer, the loss seems converged, but the accuracy is nearly zero. So my question is, when we adopted different optimizers, what do we need to change for successful network training? Weight initialization?

Best Answer

In my experience, changing optimizers is not a simple matter of swapping one for the other. Instead, changing optimizers also interacts with several other configuration choices in the neural network.

The optimizer interacts with the initialization scheme, so this might need to be changed.
The learning rate may need to be changed.
The learning rate schedule may need to be adjusted.
In some cases, SGD with momentum can be a big improvement over Adam. See: "The Marginal Value of Adaptive Gradient Methods in Machine Learning" by Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

Related Solutions

Solved – Reference to learn how to interpret learning curves of deep convolutional neural networks

2 things:

You should probably switch your 50/50 train/validation repartition to something like 80% training and 20% validation. In most cases it will improve the classifier performance overall (more training data = better performance)
If you have never heard about "early-stopping" you should look it up, it's an important concept in the neural network domain : https://en.wikipedia.org/wiki/Early_stopping . To summarize, the idea behind early-stopping is to stop the training once the validation loss starts plateauing. Indeed, when this happens it almost always mean you are starting to overfit your classifier. The training loss value in itself is not something you should trust, because it will continue to decrease even when you are overfitting your classifier.

I hope I was clear enough, good luck in your work :)

Solved – How does batch size affect Adam Optimizer

Yes, batch size affects Adam optimizer. Common batch sizes 16, 32, and 64 can be used. Results show that there is a sweet spot for batch size, where a model performs best. For example, on MNIST data, three different batch sizes gave different accuracy as shown in the table below:

|Batch Size | Test Accuracy        |
-----------------------------------|
|1024       |  96% with 30 epochs  |
|64         |  98% with 30 epochs  |
|2          |  99% with 30 epochs  |
|__________________________________|

Therefore, it can be concluded that decreasing batch size increases test accuracy. However, do not generalize these findings, as it depends on the complexity of on hand data.

Here is a detailed blog (Effect of batch size on training dynamics) that discusses impact of batch size. In addition, following research paper throw detailed overview and analysis how batch size impacts model accuracy (generalization).

Best Answer

Related Solutions

Solved – Reference to learn how to interpret learning curves of deep convolutional neural networks

Solved – How does batch size affect Adam Optimizer

Related Question