Solved – Why is it important to include a bias correction term for the Adam optimizer for Deep Learning

adamconv-neural-networkmachine learningneural networksoptimization

I was reading about the Adam optimizer for Deep Learning and came across the following sentence in the new book Deep Learning by Begnio, Goodfellow and Courtville:

Adam includes bias corrections to the estimates of both the
first-order moments (the momentum term) and the (uncentered)
second-order moments to account for their initialization at the
origin.

it seems that the main reason to include these bias correction terms is that somehow it removes the bias of the initialization of $m_t = 0$ and $v_t = 0$.

I am not 100% sure what that means but it seems to me that it probably means that the 1st and 2nd moment start at zero and somehow starting it off at zero slants the values closer to zero in an unfair (or useful) way for training?
Though I would love to know what that means a bit more precisely and how that damages the learning. In particular, what advantages does un-biasing the optimizer have in terms of optimization?
How does this help training deep learning models?
Also, what does it mean when it's unbiased? I am familiar what unbiased standard deviation means but it's not clear to me what it means in this context.
Is bias correction really a big deal or is that something overhyped in the Adam optimizer paper?

Just so people know I've tried really hard to understand the original paper but I've gotten very little out of reading and re-reading the original paper. I assume some of these question might be answered there but I can't seem to parse the answers.

Best Answer

The problem of NOT correcting the bias
According to the paper

In case of sparse gradients, for a reliable estimate of the second moment one needs to average over many gradients by chosing a small value of β2; however it is exactly this case of small β2 where a lack of initialisation bias correction would lead to initial steps that are much larger.

Normally in practice $\beta_2$ is set much closer to 1 than $\beta_1$ (as suggested by the author $\beta_2=0.999$, $\beta_1=0.9$), so the update coefficients $1-\beta_2=0.001$ is much smaller than $1-\beta_1=0.1$.

In the first step of training $m_1=0.1g_t$, $v_1=0.001g_t^2$, the $m_1/(\sqrt{v_1}+\epsilon)$ term in the parameter update can be very large if we use the biased estimation directly.

On the other hand when using the bias-corrected estimation, $\hat{m_1}=g_1$ and $\hat{v_1}=g_1^2$, the $\hat{m_t}/(\sqrt{\hat{v_t}}+\epsilon)$ term becomes less sensitive to $\beta_1$ and $\beta_2$.

How the bias is corrected
The algorithm uses moving average to estimate the first and second moments. The biased estimation would be, we start at an arbitrary guess $m_0$, and update the estimation gradually by $m_t=\beta m_{t-1}+(1-\beta)g_t$. So it's obvious in the first few steps our moving average is heavily biased towards the initial $m_0$.

To correct this, we can remove the effect of the initial guess (bias) out of the moving average. For example at time 1, $m_1=\beta m_0+(1-\beta)g_t$, we take out the $\beta m_0$ term from $m_1$ and divide it by $(1-\beta)$, which yields $\hat{m_1}=(m_1- \beta m_0)/(1-\beta)$. When $m_0=0$, $\hat{m_t}=m_t/(1-\beta^t)$. The full proof is given in Section 3 of the paper.

As Mark L. Stone has well commented

It's like multiplying by 2 (oh my, the result is biased), and then dividing by 2 to "correct" it.

Somehow this is not exactly equivalent to

the gradient at initial point is used for the initial values of these things, and then the first parameter update

(of course it can be turned into the same form by changing the update rule (see the update of the answer), and I believe this line mainly aims at showing the unnecessity of introducing the bias, but perhaps it's worth noticing the difference)

For example, the corrected first moment at time 2

$$\hat{m_2}=\frac{\beta(1-\beta)g_1+(1-\beta)g_2}{1-\beta^2}=\frac{\beta g_1+g_2}{\beta+1}$$

If using $g_1$ as the initial value with the same update rule, $$m_2=\beta g_1+(1-\beta)g_2$$ which will bias towards $g_1$ instead in the first few steps.

Is bias correction really a big deal
Since it only actually affects the first few steps of training, it seems not a very big issue, in many popular frameworks (e.g. keras, caffe) only the biased estimation is implemented.

From my experience the biased estimation sometimes leads to undesirable situations where the loss won't go down (I haven't thoroughly tested that so I'm not exactly sure whether this is due to the biased estimation or something else), and a trick that I use is using a larger $\epsilon$ to moderate the initial step sizes.

Update
If you unfold the recursive update rules, essentially $\hat{m}_t$ is a weighted average of the gradients,
$$\hat{m}_t=\frac{\beta^{t-1}g_1+\beta^{t-2}g_2+...+g_t}{\beta^{t-1}+\beta^{t-2}+...+1}$$ The denominator can be computed by the geometric sum formula, so it's equivalent to following update rule (which doesn't involve a bias term)

$m_1\leftarrow g_1$
while not converge do
$\qquad m_t\leftarrow \beta m_t + g_t$ (weighted sum)
$\qquad \hat{m}_t\leftarrow \dfrac{(1-\beta)m_t}{1-\beta^t}$ (weighted average)

Therefore it can be possibly done without introducing a bias term and correcting it. I think the paper put it in the bias-correction form for the convenience of comparing with other algorithms (e.g. RmsProp).

Related Solutions

Solved – How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor

Here is what Lasagne does, it should answer your two questions:

class Orthogonal(Initializer):
    """Intialize weights as Orthogonal matrix.
    Orthogonal matrix initialization [1]_. For n-dimensional shapes where
    n > 2, the n-1 trailing axes are flattened. For convolutional layers, this
    corresponds to the fan-in, so this makes the initialization usable for
    both dense and convolutional layers.
    Parameters
    ----------
    gain : float or 'relu'
        Scaling factor for the weights. Set this to ``1.0`` for linear and
        sigmoid units, to 'relu' or ``sqrt(2)`` for rectified linear units, and
        to ``sqrt(2/(1+alpha**2))`` for leaky rectified linear units with
        leakiness ``alpha``. Other transfer functions may need different
        factors.
    References
    ----------
    .. [1] Saxe, Andrew M., James L. McClelland, and Surya Ganguli.
           "Exact solutions to the nonlinear dynamics of learning in deep
           linear neural networks." arXiv preprint arXiv:1312.6120 (2013).
    """
    def __init__(self, gain=1.0):
        if gain == 'relu':
            gain = np.sqrt(2)

        self.gain = gain

    def sample(self, shape):
        if len(shape) < 2:
            raise RuntimeError("Only shapes of length 2 or more are "
                               "supported.")

        flat_shape = (shape[0], np.prod(shape[1:]))
        a = get_rng().normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        # pick the one with the correct shape
        q = u if u.shape == flat_shape else v
        q = q.reshape(shape)
        return floatX(self.gain * q)

This RNN tutorial does the same thing (minus the gain):

# orthogonal initialization for weights
# see Saxe et al. ICLR'14
def ortho_weight(ndim):
    W = numpy.random.randn(ndim, ndim)
    u, s, v = numpy.linalg.svd(W)
    return u.astype('float32')

So I assume it's correct (I hope so since this is the code I use).

I probably should have mentioned but Iam planning to use it with python/tensorflow if possible.

In TensorFlow:

def orthogonal_initializer(scale = 1.1):
    ''' From Lasagne and Keras. Reference: Saxe et al., http://arxiv.org/abs/1312.6120
    '''
    print('Warning -- You have opted to use the orthogonal_initializer function')
    def _initializer(shape, dtype=tf.float32):
      flat_shape = (shape[0], np.prod(shape[1:]))
      a = np.random.normal(0.0, 1.0, flat_shape)
      u, _, v = np.linalg.svd(a, full_matrices=False)
      # pick the one with the correct shape
      q = u if u.shape == flat_shape else v
      q = q.reshape(shape) #this needs to be corrected to float32
      print('you have initialized one orthogonal matrix.')
      return tf.constant(scale * q[:shape[0], :shape[1]], dtype=tf.float32)
    return _initializer

Solved – the reason that the Adam Optimizer is considered robust to the value of its hyper parameters

In regards to the evidence in regards to the claim, I believe the only evidence supporting the claim can be found on figure 4 in their paper. They show the final results under a range of different values for $\beta_1$, $\beta_2$ and $\alpha$.

Personally, I don't find their argument convincing, in particular because they do not present results across a variety of problems. With that said, I will note that I have used ADAM for a variety of problems, and my personal finding is that the default values of $\beta_1$ and $\beta_2$ do seem surprisingly reliable, although a good deal of fiddling with $\alpha$ is required.

Best Answer

Related Solutions

Solved – How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor

Solved – the reason that the Adam Optimizer is considered robust to the value of its hyper parameters

Related Question