Solved – the correct equation of AdaGrad one should use if one aims to use AdaGrad in practice as the automatic way to choose the step size

gradient descentmachine learningneural networksoptimization

I was reading Duchi et al. AdaGrad paper and also a shorter paper on Adaptive Online Gradient Descent because I was looking to implement an update rule which was automatically chosen and that was commonly used in Deep Learning when using stochastic gradient descent such as:

$$ \theta^{(t+1)} = \theta^{(t)} – \eta_t \nabla L(y, f_\theta (x) ) $$

with respect to some loss function $L$.

In that paper I came across a lot of equations, proofs and dense mathematics but I had a hard time extracting a single equation that I could implement in a commonly used data processing program like MATLAB or Octave or python etc.

In my quest to find such an equation I thought I found something promising in optimization lecture by Cambridge University where they had:

$$ w^{(t+1)}_i = w^{(t)}_i – \frac{\eta}{\sqrt{\sum^t_{\tau = 1} g^2_{\tau,i}} } g_{t , u}$$

Where it seems that the step size that they use is:

$$ \eta_t = \frac{\eta}{\sqrt{\sum^t_{\tau = 1} g^2_{\tau,i}} } $$

however, in that lecture they don't really explain in a lot of detail what that equation means and what each term means. From reading the paper on AdaGrad it seems to me that $g_{\tau , i}$ is simply the (sub) gradient at iteration $\tau$ from entry $i$ (i.e. $\frac{\partial f_\tau}{\partial \theta_i}$ at iteration $\tau$ ).

Since the lecture nor the paper seem to explain these well here are my questions:

  1. My question is, where is that equation in the paper? How does one get it? Is that equation the correct update one should use? I just want to make sure that is the correct equation and to possibly derive it myself next time.
  2. What is $\eta$ in $\eta_t = \frac{\eta}{\sqrt{\sum^t_{\tau = 1} g^2_{\tau,i}} } $? Is it another weird constant I need to choose? How does one choose it? (if I could find where this equation was on the paper, I would just reference the paper… but where is it?)
  3. For what application/scenarios is that the correct equation to use? If I have a different network I want to train, how do I change that update rule to suit the network I am training? i.e. how do I adapt it to different scenarios where SGD might be applied?

If someone finds out how to obtain that equation, I'd love to understand how to derive it from the paper or where it came from! 😀

Best Answer

  1. The update rule

$$w_i^{(t+1)} = w_i^{(t)} - \frac{\eta}{\sqrt{\sum_{\tau=1}^t g_{\tau,i}^2}}g_{t,i},$$

is the composite mirror descent variety of ADAGRAD. Take a look at equation 4 (or 23) from the paper. When there's no regularization term, the update can be derived (take derivative and set to 0) as:

$$ w_{t+1} = w_t - \eta \,\text{diag}(G)^{-1/2} g_t.$$

Notice that $\text{diag}(G)^{-1/2}_{ii} = \frac{1}{\sqrt{\sum_{\tau=1}^t g_{\tau,i}^2}}$ and that multiplying a diagonal matrix by a vector amounts to element-wise multiplication.

  1. $\eta$ is a constant step size which determines the size of the step at each update. I would try several choices and compare them.

  2. Changing an algorithm from SGD to ADAGRAD just requires plugging in your gradient values. That is, you need to keep track of the sum of squared gradient elements for the term in the denominator.

Related Question