Solved – RMSprop difference rho and decay in Tensorflow

deep learningtensorflow

As this post showed nicely, there is a difference between rho and the decay in RMSprop. I can't clearly see what tensorflows RMSprop parameter decays stands for. Is this the learning rate decay? And if so, where is the rho?

Best Answer

In the post that you added they talk about RMSprop implementation from keras (it's included in tensorflow), but the other link that you added is original implementation for the RMSProp from tensorflow. Tensorflow's implementation has parameter called decay, but it's the same as rho in the keras library. But in the RMSProp implantation from keras the decay parameter reduced learning rate overtime.

Related Solutions

Adam Optimizer – Using Adam Optimizer with Exponential Decay

Empirically speaking: definitely try it out, you may find some very useful training heuristics, in which case, please do share!

Usually people use some kind of decay, for Adam it seems uncommon. Is there any theoretical reason for this? Can it be useful to combine Adam optimizer with decay?

I haven't seen enough people's code using ADAM optimizer to say if this is true or not. If it is true, perhaps it's because ADAM is relatively new and learning rate decay "best practices" haven't been established yet.

I do want to note however that learning rate decay is actually part of the theoretical guarantee for ADAM. Specifically in Theorem 4.1 of their ICLR article, one of their hypotheses is that the learning rate has a square root decay, $\alpha_t = \alpha/\sqrt{t}$. Furthermore, for their logistic regression experiments they use the square root decay as well.

Simply put: I don't think anything in the theory discourages using learning rate decay rules with ADAM. I have seen people report some good results using ADAM and finding some good training heuristics would be incredibly valuable.

Solved – Weight decay and RMSprop in neural networks

For straight SGD, you have $$\theta_{t+1} = \theta_t - \eta g_t$$ and $\eta$ is the "learning rate" (a.k.a. step size).

In principle any gradient descent, including SGD, is trying to find a stationary point of the objective function $f[\theta]$, where $g=\partial_{\theta}f$ (and hopefully a local minimum, rather than a maximum or saddle). It is easy to see that $$g_t=0 \implies \theta_{t+1} = \theta_t$$ so the SGD update is consistent with this (as is the RMSprop update you cite).

When you give the regularized update equation $$\theta_{t+1} = \theta_t - \eta (g_t + 2\lambda\theta_t)$$ notice how $$\theta_{t+1} = \theta_t \implies g_t + 2\lambda\theta_t = 0$$ So the stationary point no longer corresponds to $g=0$, but rather corresponds to $$\hat{g} = g + 2\lambda\theta = \partial_{\theta}(f+\lambda\theta^2) = 0$$ i.e. the gradient of the regularized objective function, which includes a penalty term. (This also applies to the first "regularized RMSprop update" formula you give.)

Your "adjusted RMSprop" update equation $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t+2\eta\lambda\theta_t$$ corresponds to a stationary point $$\theta_{t+1} = \theta_t \implies g_t - 2\left(\lambda\sqrt{E[g^2]_t+\epsilon}\right)\theta_t = 0$$ This shows that the update does not correspond to any consistent objective function. Rather, it corresponds to an "evolving objective function" where the effective regularization weight $\hat{\lambda}$ changes through time, and depends on the path the optimization takes, i.e. $E[g^2]$. (Note: It appears you have a sign change in the last formula ... did you mean to have a $-\lambda$ perhaps?)

Most "momentum" techniques will try to preserve the stationary points of the objective function (which may include penalty terms). For your question 2, I would say the standard approach is simply to add the penalty term to the objective function, so that it shows up in the gradient $g$ automatically (and then RMSprop, or whatever method, will incorporate it into $E[g^2]$).

For your question 1, I would say that you are changing the penalty, so it is definitely not standard penalty-term regularization (which would change the objective function). It actually appears more similar to the Levenberg-Marquardt algorithm for nonlinear least squares, in that the "regularization" goes to zero as $E[g^2]$ goes to zero. (However there, I believe the averaging would always be over "all the data", so not path dependent.)

Best Answer

Related Solutions

Adam Optimizer – Using Adam Optimizer with Exponential Decay

Solved – Weight decay and RMSprop in neural networks

Related Question