Solved – the reason that the Adam Optimizer is considered robust to the value of its hyper parameters

adamdeep learninghyperparameterneural networksoptimization

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Bengio, Goodfellow and Courville:

Adam is generally regarded as being fairly robust to the choice of hyper parameters, though the learning rate sometimes needs to be changed from the suggested default.

if this is true its a big deal because hyper parameter search can be really important (in my experience at least) in the statistical performance of a deep learning system. Thus, my question is, why is Adam Robust to such important parameters? Specially $\beta_1$ and $\beta_2$?

I've read the Adam paper and it doesn't provide any explanation to why it works with those parameters or why its robust. Do they justify that elsewhere?

Also, as I read the paper, it seems that the number of hyper parameters they tried where very small, for $\beta_1$ only 2 and for $\beta_2$ only 3. How can this be a thorough empirical study if it only works on 2×3 hyper parameters?

Best Answer

In regards to the evidence in regards to the claim, I believe the only evidence supporting the claim can be found on figure 4 in their paper. They show the final results under a range of different values for $\beta_1$, $\beta_2$ and $\alpha$.

Personally, I don't find their argument convincing, in particular because they do not present results across a variety of problems. With that said, I will note that I have used ADAM for a variety of problems, and my personal finding is that the default values of $\beta_1$ and $\beta_2$ do seem surprisingly reliable, although a good deal of fiddling with $\alpha$ is required.

Related Solutions

Adam Method – How Stochastic Gradient Descent Works in Neural Networks

The Adam paper says, "...many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions..." Here, they just mean that the objective function is a sum of errors over training examples, and training can be done on individual examples or minibatches. This is the same as in stochastic gradient descent (SGD), which is more efficient for large scale problems than batch training because parameter updates are more frequent.

As for why Adam works, it uses a few tricks.

One of these tricks is momentum, which can give faster convergence. Imagine an objective function that's shaped like a long, narrow canyon that gradually slopes toward a minimum. Say we want to minimize this function using gradient descent. If we start from some point on the canyon wall, the negative gradient will point in the direction of steepest descent, i.e. mostly toward the canyon floor. This is because the canyon walls are much steeper than the gradual slope of the canyon toward the minimum. If the learning rate (i.e. step size) is small, we could descend to the canyon floor, then follow it toward the minimum. But, progress would be slow. We could increase the learning rate, but this wouldn't change the direction of the steps. In this case, we'd overshoot the canyon floor and end up on the opposite wall. We would then repeat this pattern, oscillating from wall to wall while making slow progress toward the minimum. Momentum can help in this situation.

Momentum simply means that some fraction of the previous update is added to the current update, so that repeated updates in a particular direction compound; we build up momentum, moving faster and faster in that direction. In the case of the canyon, we'd build up momentum in the direction of the minimum, since all updates have a component in that direction. In contrast, moving back and forth across the canyon walls involves constantly reversing direction, so momentum would help to damp the oscillations in those directions.

Another trick that Adam uses is to adaptively select a separate learning rate for each parameter. Parameters that would ordinarily receive smaller or less frequent updates receive larger updates with Adam (the reverse is also true). This speeds learning in cases where the appropriate learning rates vary across parameters. For example, in deep networks, gradients can become small at early layers, and it make sense to increase learning rates for the corresponding parameters. Another benefit to this approach is that, because learning rates are adjusted automatically, manual tuning becomes less important. Standard SGD requires careful tuning (and possibly online adjustment) of learning rates, but this less true with Adam and related methods. It's still necessary to select hyperparameters, but performance is less sensitive to them than to SGD learning rates.

Related methods:

Momentum is often used with standard SGD. An improved version is called Nesterov momentum or Nesterov accelerated gradient. Other methods that use automatically tuned learning rates for each parameter include: Adagrad, RMSprop, and Adadelta. RMSprop and Adadelta solve a problem with Adagrad that could cause learning to stop. Adam is similar to RMSprop with momentum. Nadam modifies Adam to use Nesterov momentum instead of classical momentum.

References:

Kingma and Ba (2014). Adam: A Method for Stochastic Optimization.

Goodfellow et al. (2016). Deep learning, chapter 8.

Slides from Geoff Hinton's course

Dozat (2016). Incorporating Nesterov Momentum into Adam.

Solved – How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor

Here is what Lasagne does, it should answer your two questions:

class Orthogonal(Initializer):
    """Intialize weights as Orthogonal matrix.
    Orthogonal matrix initialization [1]_. For n-dimensional shapes where
    n > 2, the n-1 trailing axes are flattened. For convolutional layers, this
    corresponds to the fan-in, so this makes the initialization usable for
    both dense and convolutional layers.
    Parameters
    ----------
    gain : float or 'relu'
        Scaling factor for the weights. Set this to ``1.0`` for linear and
        sigmoid units, to 'relu' or ``sqrt(2)`` for rectified linear units, and
        to ``sqrt(2/(1+alpha**2))`` for leaky rectified linear units with
        leakiness ``alpha``. Other transfer functions may need different
        factors.
    References
    ----------
    .. [1] Saxe, Andrew M., James L. McClelland, and Surya Ganguli.
           "Exact solutions to the nonlinear dynamics of learning in deep
           linear neural networks." arXiv preprint arXiv:1312.6120 (2013).
    """
    def __init__(self, gain=1.0):
        if gain == 'relu':
            gain = np.sqrt(2)

        self.gain = gain

    def sample(self, shape):
        if len(shape) < 2:
            raise RuntimeError("Only shapes of length 2 or more are "
                               "supported.")

        flat_shape = (shape[0], np.prod(shape[1:]))
        a = get_rng().normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        # pick the one with the correct shape
        q = u if u.shape == flat_shape else v
        q = q.reshape(shape)
        return floatX(self.gain * q)

This RNN tutorial does the same thing (minus the gain):

# orthogonal initialization for weights
# see Saxe et al. ICLR'14
def ortho_weight(ndim):
    W = numpy.random.randn(ndim, ndim)
    u, s, v = numpy.linalg.svd(W)
    return u.astype('float32')

So I assume it's correct (I hope so since this is the code I use).

I probably should have mentioned but Iam planning to use it with python/tensorflow if possible.

In TensorFlow:

def orthogonal_initializer(scale = 1.1):
    ''' From Lasagne and Keras. Reference: Saxe et al., http://arxiv.org/abs/1312.6120
    '''
    print('Warning -- You have opted to use the orthogonal_initializer function')
    def _initializer(shape, dtype=tf.float32):
      flat_shape = (shape[0], np.prod(shape[1:]))
      a = np.random.normal(0.0, 1.0, flat_shape)
      u, _, v = np.linalg.svd(a, full_matrices=False)
      # pick the one with the correct shape
      q = u if u.shape == flat_shape else v
      q = q.reshape(shape) #this needs to be corrected to float32
      print('you have initialized one orthogonal matrix.')
      return tf.constant(scale * q[:shape[0], :shape[1]], dtype=tf.float32)
    return _initializer

Best Answer

Related Solutions

Adam Method – How Stochastic Gradient Descent Works in Neural Networks

Solved – How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor

Related Question