Solved – Weight decay and RMSprop in neural networks

deep learninggradient descentmachine learningneural networksregularization

I've been implementing RMSprop following this helpful blog post. The post doesn't talk about weight decay, i.e. regularization. What I'm implementing is effectively a ridge penalty.

The RMSprop update is defined as
$$
\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t
$$
where $\eta$ is the learning rate, $E[g^2]$ is the RMSprop running average of the past squared gradients, $\epsilon$ is the don't-divide-by-zero fudge factor, and $g_t$ is the gradient.

Now, for normal SGD with weight decay, I would have
$$
\theta_{t+1} = \theta_t – \eta (g_t + 2\lambda\theta_t)
$$
For RMSprop, I first did

$$
\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}(g_t+2\lambda\theta_t)
$$
That didn't work very well. MSE at convergence was essentially insensitive to the penalty factor. Without a whole lot of theoretical justification, I tried
$$
\theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t+2\eta\lambda\theta_t
$$
…which worked a lot better.

My questions:

  1. Why did this work better? I guess it is because you don't want to adaptively penalize, but you do want to adaptively change the learning rate. Adaptively penalizing would basically shrink the ridge penalty with the step size.
  2. Is there a better to regularize in the context of RMSprop?

Best Answer

For straight SGD, you have $$\theta_{t+1} = \theta_t - \eta g_t$$ and $\eta$ is the "learning rate" (a.k.a. step size).

In principle any gradient descent, including SGD, is trying to find a stationary point of the objective function $f[\theta]$, where $g=\partial_{\theta}f$ (and hopefully a local minimum, rather than a maximum or saddle). It is easy to see that $$g_t=0 \implies \theta_{t+1} = \theta_t$$ so the SGD update is consistent with this (as is the RMSprop update you cite).

When you give the regularized update equation $$\theta_{t+1} = \theta_t - \eta (g_t + 2\lambda\theta_t)$$ notice how $$\theta_{t+1} = \theta_t \implies g_t + 2\lambda\theta_t = 0$$ So the stationary point no longer corresponds to $g=0$, but rather corresponds to $$\hat{g} = g + 2\lambda\theta = \partial_{\theta}(f+\lambda\theta^2) = 0$$ i.e. the gradient of the regularized objective function, which includes a penalty term. (This also applies to the first "regularized RMSprop update" formula you give.)

Your "adjusted RMSprop" update equation $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t+2\eta\lambda\theta_t$$ corresponds to a stationary point $$\theta_{t+1} = \theta_t \implies g_t - 2\left(\lambda\sqrt{E[g^2]_t+\epsilon}\right)\theta_t = 0$$ This shows that the update does not correspond to any consistent objective function. Rather, it corresponds to an "evolving objective function" where the effective regularization weight $\hat{\lambda}$ changes through time, and depends on the path the optimization takes, i.e. $E[g^2]$. (Note: It appears you have a sign change in the last formula ... did you mean to have a $-\lambda$ perhaps?)

Most "momentum" techniques will try to preserve the stationary points of the objective function (which may include penalty terms). For your question 2, I would say the standard approach is simply to add the penalty term to the objective function, so that it shows up in the gradient $g$ automatically (and then RMSprop, or whatever method, will incorporate it into $E[g^2]$).

For your question 1, I would say that you are changing the penalty, so it is definitely not standard penalty-term regularization (which would change the objective function). It actually appears more similar to the Levenberg-Marquardt algorithm for nonlinear least squares, in that the "regularization" goes to zero as $E[g^2]$ goes to zero. (However there, I believe the averaging would always be over "all the data", so not path dependent.)