Solved – Adam optimizer with exponential decay

adamdeep learninggradient descentneural networkstensorflow

In most Tensorflow code I have seen Adam Optimizer is used with a constant Learning Rate of 1e-4 (i.e. 0.0001). The code usually looks the following:

...build the model...
# Add the optimizer
train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
# Add the ops to initialize variables.  These will include 
# the optimizer slots added by AdamOptimizer().
init_op = tf.initialize_all_variables()

# launch the graph in a session
sess = tf.Session()
# Actually intialize the variables
sess.run(init_op)
# now train your model
for ...:
  sess.run(train_op)

I am wondering, whether it is useful to use exponential decay when using adam optimizer, i.e. use the following Code:

...build the model...
# Add the optimizer
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.15, step, 1, 0.9999)
optimizer = tf.train.AdamOptimizer(rate).minimize(cross_entropy, global_step=step)
# Add the ops to initialize variables.  These will include 
# the optimizer slots added by AdamOptimizer().
init_op = tf.initialize_all_variables()

# launch the graph in a session
sess = tf.Session()
# Actually intialize the variables
sess.run(init_op)
# now train your model
for ...:
  sess.run(train_op)

Usually, people use some kind of learning rate decay, for Adam it seems uncommon. Is there any theoretical reason for this? Can it be useful to combine Adam optimizer with decay?

Best Answer

Empirically speaking: definitely try it out, you may find some very useful training heuristics, in which case, please do share!

Usually people use some kind of decay, for Adam it seems uncommon. Is there any theoretical reason for this? Can it be useful to combine Adam optimizer with decay?

I haven't seen enough people's code using ADAM optimizer to say if this is true or not. If it is true, perhaps it's because ADAM is relatively new and learning rate decay "best practices" haven't been established yet.

I do want to note however that learning rate decay is actually part of the theoretical guarantee for ADAM. Specifically in Theorem 4.1 of their ICLR article, one of their hypotheses is that the learning rate has a square root decay, $\alpha_t = \alpha/\sqrt{t}$. Furthermore, for their logistic regression experiments they use the square root decay as well.

Simply put: I don't think anything in the theory discourages using learning rate decay rules with ADAM. I have seen people report some good results using ADAM and finding some good training heuristics would be incredibly valuable.

Related Question