I have been training a Spatial Transformer network with DNN on GTRSB dataset. I initially used SGF with momentum and was able to achieve good accuracy.
For further improvements and testing, I decided to change the optimizer to Adam, but strangely I am not seeing any increasing in training or validation accuracy after quite a many epochs.
Is it possible that Adam is not well suited for this dataset? (or in general is it possible for optimizers to work on one dataset but work fine on others?)
Edit: I tried running Adam on a smaller dataset with lower learning rate (initial was 0.01, now I set it to 0.001). On smaller one, it should some increase in accuracy, but on the bigger dataset the issue still persists.
Edit 2: Further decreasing the learning rate to 0.0001 makes even Adam work on larger dataset.
Thanks
Best Answer
The benefits of Adam can be marginal, at best. The initial results were strong, but there is evidence that Adam converges to dramatically different minima compared to SGD (or SGD + momentum).
"The Marginal Value of Adaptive Gradient Methods in Machine Learning" Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht
Speaking from personal experience, Adam can struggle unless you set a small learning rate -- which sort of defeats the whole purpose of using an adaptive method in the first place, not to mention all of the wasted time spent toying with learning rate.