Solved – How to (systematically) tune learning rate having Gradient Descent as the Optimizer

deep learningmachine learningpythontensorflow

An outsider to ML/DL field; started Udacity Deep Learning course which is based on Tensorflow; doing the assignment 3 problem 4; trying to tune the learning rate with the following config:

  • Batch size 128
  • Number of steps: enough to fill up 2 epochs
  • Sizes of hidden layers: 1024, 305, 75
  • Weight initialization: truncated normal with std. deviation of sqrt(2/n) where n is the size of the previous layer
  • Dropout keep probability: 0.75
  • Regularization: not applied
  • Learning Rate algorithm: exponential decay

played around with learning rate parameters; they don't seem to have effect in most cases; code here; results:

Accuracy    learning_rate   decay_steps     decay_rate      staircase
93.7        .1              3000            .96             True
94.0        .3              3000            .86             False
94.0        .3              3000            .96             False
94.0        .3              3000            .96             True
94.0        .5              3000            .96             True
  • How should I systematically tune learning rate?
  • How is learning rate related to the number of steps?

Best Answer

Use a gradient descent optimizer. This is a very good overview.

Regarding the code, have a look at this tutorial. This and this are some examples.

Personally, I suggest to use either ADAM or RMSprop. There are still some hyperparameters to set, but there are some "standard" ones that work 99% of the time. For ADAM you can look at its paper and for RMSprop at this slides.

EDIT

Ok, you already use a gradient optimizer. Then you can perform some hyperparameters optimization to select the best learning rate. Recently, an automated approach has been proposed. Also, there is a lot of promising work by Frank Hutter regarding automated hyperparameters tuning.

More in general, have a look at the AutoML Challenge, where you can also find source code by the teams. In this challenge, the goal is to automate machine learning, including hyperparameters tuning.

Finally, this paper by LeCun and this very recent tutorial by DeepMin (check Chapter 8) give some insights that might be useful for your question.

Anyway, keep in mind that (especially for easy problems), it's normal that the learning rate doesn't affect much the learning when using a gradient descent optimizer. Usually, these optimizers are very reliable and work with different parameters.