For straight SGD, you have
$$\theta_{t+1} = \theta_t - \eta g_t$$
and $\eta$ is the "learning rate" (a.k.a. step size).

In principle any gradient descent, including SGD, is trying to find a stationary point of the objective function $f[\theta]$, where $g=\partial_{\theta}f$ (and hopefully a local minimum, rather than a maximum or saddle). It is easy to see that
$$g_t=0 \implies \theta_{t+1} = \theta_t$$
so the SGD update is consistent with this (as is the RMSprop update you cite).

When you give the regularized update equation
$$\theta_{t+1} = \theta_t - \eta (g_t + 2\lambda\theta_t)$$
notice how
$$\theta_{t+1} = \theta_t \implies g_t + 2\lambda\theta_t = 0$$
So the stationary point no longer corresponds to $g=0$, but rather corresponds to
$$\hat{g} = g + 2\lambda\theta = \partial_{\theta}(f+\lambda\theta^2) = 0$$
i.e. the gradient of the regularized objective function, which includes a penalty term. (This also applies to the first "regularized RMSprop update" formula you give.)

Your "adjusted RMSprop" update equation
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t+2\eta\lambda\theta_t$$
corresponds to a stationary point
$$\theta_{t+1} = \theta_t \implies g_t - 2\left(\lambda\sqrt{E[g^2]_t+\epsilon}\right)\theta_t = 0$$
This shows that the update does not correspond to any consistent objective function. Rather, it corresponds to an "evolving objective function" where the effective regularization weight $\hat{\lambda}$ changes through time, and depends on the *path* the optimization takes, i.e. $E[g^2]$. (Note: It appears you have a sign change in the last formula ... did you mean to have a $-\lambda$ perhaps?)

Most "momentum" techniques will try to preserve the stationary points of the objective function (which may include penalty terms). For your question 2, I would say the standard approach is simply to add the penalty term to the objective function, so that it shows up in the gradient $g$ automatically (and then RMSprop, or whatever method, will incorporate it into $E[g^2]$).

For your question 1, I would say that you *are* changing the penalty, so it is definitely *not* standard penalty-term regularization (which would change the objective function). It actually appears more similar to the Levenberg-Marquardt algorithm for nonlinear least squares, in that the "regularization" goes to zero as $E[g^2]$ goes to zero. (However there, I believe the averaging would always be over "all the data", so not path dependent.)

In general, there aren't definitive results on one learning algorithm being "better" than another. The common wisdom (which needs to be taken with a pound of salt) has been that Adam requires less experimentation to get convergence on the first try than SGD and variants thereof. However, this is highly dataset/model dependent. More recently some groups have made the claim that despite training faster, models trained with Adam generalize worse to the validation or test sets. See https://arxiv.org/abs/1712.07628 for more information.

## Best Answer

After researching a few articles online and Keras documentation it is suggested that the RMSProp optimizer is recommended for recurrent neural networks.https://github.com/keras-team/keras/blob/master/keras/optimizers.py#L209

Stochastic Gradient Descent seems to take advantage of its learning rate and momentum between each batch to optimize the modelâ€™s weights based on the information of the loss function in my case is 'categorical_crossentropy'.

I suggest http://ruder.io/optimizing-gradient-descent/index.html for additional information about optimization algorithms.