Keras RMSprop – Difference Between Rho and Decay Arguments

kerasoptimizationrecurrent neural networkrmstensorflow

I am working to tune a RNN for the purposes of predictive analytics on time series data. I am testing different optimizers and am currently working with RMSprop.

I have reviewed Hinton's lecture notes on the subject, as well as the Keras documentation on included optimizers. However, for the life of me, I cannot find a good explanation of the difference between rho and decay.

From these docs, I know that rho also refers to decay rate, but the difference between the two arguments are anything but clear to me at this point.

keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0)

Best Answer

Short explanation

rho is the "Gradient moving average [also exponentially weighted average] decay factor" and decay is the "Learning rate decay over each update".

Long explanation

RMSProp is defined as follows

source

So RMSProp uses "rho" to calculate an exponentially weighted average over the square of the gradients.

Note that "rho" is a direct parameter of the RMSProp optimizer (it is used in the RMSProp formula).

Decay on the other hand handles learning rate decay. Learning rate decay is a mechanism generally applied independently of the chosen optimizer. Keras simply builds this mechanism into the RMSProp optimizer for convenience (as does it with other optimizers like SGD and Adam which all have the same "decay" parameter). You may think of the "decay" parameter as "lr_decay".

It can be confusing at first that there are two decay parameters, but they are decaying different values.

"rho" is the decay factor or the exponentially weighted average over the square of the gradients.
"decay" decays the learning rate over time, so we can move even closer to the local minimum in the end of training.

Best Answer

Related Solutions

Solved – RMSProp and Adam vs SGD

Solved – difference between keras embedding layer and word2vec

Related Question