Solved – RMSProp and Adam vs SGD

adammachine learningoptimizationstochastic gradient descent

I am performing experiments on the EMNIST validation set using networks with RMSProp, Adam and SGD. I am achieving 87% accuracy with SGD(learning rate of 0.1) and dropout (0.1 dropout prob) as well as L2 regularisation (1e-05 penalty). When testing the same exact configuration with RMSProp and Adam as well as the initial learning rate of 0.001, I am achieving accuracy of 85% and a significantly less smooth training curve. I do not know how to explain this behaviour. What can be the reason behind the lack of smoothness in the training curve and the lower accuracy and higher error rates achieved?

Best Answer

After researching a few articles online and Keras documentation it is suggested that the RMSProp optimizer is recommended for recurrent neural networks.https://github.com/keras-team/keras/blob/master/keras/optimizers.py#L209

Stochastic Gradient Descent seems to take advantage of its learning rate and momentum between each batch to optimize the model’s weights based on the information of the loss function in my case is 'categorical_crossentropy'.

I suggest http://ruder.io/optimizing-gradient-descent/index.html for additional information about optimization algorithms.