Solved – Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)

errorgradient descentmachine learningneural networkssupervised learning

I've written a simple MLP in TensorFlow which is modelling a XOR-Gate.

So for:

input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]

it should produce the following:

output_data = [[0.], [1.], [1.], [0.]]

The network has an input layer, a hidden layer and an output layer with 2, 5 and 1 neurons each.

Currently I have the following cross entropy:

cross_entropy = -(n_output * tf.log(output) + (1 - n_output) * tf.log(1 - output))

I've also tried this simpler alternative:

cross_entropy = tf.square(n_output - output)

alongside with some other tries.


However, no matter what my setup was, the error with a GradientDescentOptimizer was decreasing much slower than an AdamOptimizer.

In fact tf.train.AdamOptimizer(0.01) produced really good results after 400-800 learning steps (in dependency of the learning rate, where 0.01 had the best results) while tf.train.GradientDescentOptimizer always needed over 2000 learning steps no matter what cross entropy calculation or learning rate was used.

Why is this so? It seems the AdamOptimizer is always a better choice?!

Best Answer

The tf.train.AdamOptimizer uses Kingma and Ba's Adam algorithm to control the learning rate. Adam offers several advantages over the simple tf.train.GradientDescentOptimizer. Foremost is that it uses moving averages of the parameters (momentum); Bengio discusses the reasons for why this is beneficial in Section 3.1.1 of this paper. Simply put, this enables Adam to use a larger effective step size, and the algorithm will converge to this step size without fine tuning.

The main down side of the algorithm is that Adam requires more computation to be performed for each parameter in each training step (to maintain the moving averages and variance, and calculate the scaled gradient); and more state to be retained for each parameter (approximately tripling the size of the model to store the average and variance for each parameter). A simple tf.train.GradientDescentOptimizer could equally be used in your MLP, but would require more hyperparameter tuning before it would converge as quickly.

Related Question