[Math] Projected gradient descent with momentum

convex optimizationgradient descentnumerical methodsoptimization

Can we apply momentum to projected gradient descent? If so, how should we do that?

In the domain I'm working on, momentum greatly speeds up gradient descent. However, I want to do projected gradient descent, rather than plain-old gradient descent. In project gradient descent, after each update step, we project $x$ to a convex set $\mathcal{C}$. If I apply momentum naively to projected gradient descent, things seem to work poorly. I'm guessing this is because the projection changes the input to the next iteration of gradient descent and the algorithm doesn't expect this.

Is there a suitable update rule for projected gradient descent with momentum? How should we modify the standard rule for momentum to incorporate the projection operation?

Best Answer

It appears that there are methods for accelerated projected/proximal gradient descent, though no one seems to have worked out how to combine the state-of-the-art best methods for accelerated gradient descent (e.g., Adam, RMSprop, etc.) with projected/proximal gradient descent yet -- so you can choose either projected/proximal gradient descent with a sub-par method of acceleration, or normal gradient descent with a method of acceleration that works better in practice.

In other detail... there are multiple methods of acceleration: e.g., simple momentum, Nesterov acceleration, Adagrad, Adadelta, RMSprop, Adam. In practice, experience (with ordinary gradient descent) suggests that some of these perform better than others. For instance, for some machine learning tasks, right now Adam appears to be the most effective of those.

When you move from ordinary gradient descent to projected/proximal gradient descent, there has been some work on combining some of those methods of acceleration to proximal gradient descent... but the literature lags a bit. For instance, it appears that no one has worked out how to apply Adam-style momentum to projected or proximal gradient descent. Perhaps in time the literature will catch up.

Best Answer

Related Solutions

[Math] Batch vs incremental gradient descent

[Math] Gradient descent vs. Newton’s method — which one requires more computation

Related Question