Empirically speaking: definitely try it out, you may find some very useful training heuristics, in which case, please do share!
Usually people use some kind of decay, for Adam it seems uncommon. Is there any theoretical reason for this? Can it be useful to combine Adam optimizer with decay?
I haven't seen enough people's code using ADAM optimizer to say if this is true or not. If it is true, perhaps it's because ADAM is relatively new and learning rate decay "best practices" haven't been established yet.
I do want to note however that learning rate decay is actually part of the theoretical guarantee for ADAM. Specifically in Theorem 4.1 of their ICLR article, one of their hypotheses is that the learning rate has a square root decay, $\alpha_t = \alpha/\sqrt{t}$. Furthermore, for their logistic regression experiments they use the square root decay as well.
Simply put: I don't think anything in the theory discourages using learning rate decay rules with ADAM. I have seen people report some good results using ADAM and finding some good training heuristics would be incredibly valuable.
Use a gradient descent optimizer. This is a very good overview.
Regarding the code, have a look at this tutorial. This and this are some examples.
Personally, I suggest to use either ADAM or RMSprop. There are still some hyperparameters to set, but there are some "standard" ones that work 99% of the time. For ADAM you can look at its paper and for RMSprop at this slides.
EDIT
Ok, you already use a gradient optimizer. Then you can perform some hyperparameters optimization to select the best learning rate. Recently, an automated approach has been proposed. Also, there is a lot of promising work by Frank Hutter regarding automated hyperparameters tuning.
More in general, have a look at the AutoML Challenge, where you can also find source code by the teams. In this challenge, the goal is to automate machine learning, including hyperparameters tuning.
Finally, this paper by LeCun and this very recent tutorial by DeepMin (check Chapter 8) give some insights that might be useful for your question.
Anyway, keep in mind that (especially for easy problems), it's normal that the learning rate doesn't affect much the learning when using a gradient descent optimizer. Usually, these optimizers are very reliable and work with different parameters.
Best Answer
As mentioned in the code of the function the relation of
decay_steps
withdecayed_learning_rate
is the following:Hence, you should set the
decay_steps
proportional to theglobal_step
of the algorithm.