Solved – No change in accuracy using Adam Optimizer when SGD works fine

adamconv-neural-networkneural networksoptimizationstochastic gradient descent

I have been training a Spatial Transformer network with DNN on GTRSB dataset. I initially used SGF with momentum and was able to achieve good accuracy.

For further improvements and testing, I decided to change the optimizer to Adam, but strangely I am not seeing any increasing in training or validation accuracy after quite a many epochs.

Is it possible that Adam is not well suited for this dataset? (or in general is it possible for optimizers to work on one dataset but work fine on others?)

Edit: I tried running Adam on a smaller dataset with lower learning rate (initial was 0.01, now I set it to 0.001). On smaller one, it should some increase in accuracy, but on the bigger dataset the issue still persists.

Edit 2: Further decreasing the learning rate to 0.0001 makes even Adam work on larger dataset.

Thanks

Best Answer

The benefits of Adam can be marginal, at best. The initial results were strong, but there is evidence that Adam converges to dramatically different minima compared to SGD (or SGD + momentum).

"The Marginal Value of Adaptive Gradient Methods in Machine Learning" Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple over-parameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

Speaking from personal experience, Adam can struggle unless you set a small learning rate -- which sort of defeats the whole purpose of using an adaptive method in the first place, not to mention all of the wasted time spent toying with learning rate.

Related Question