Solved – Can weight decay be higher than learning rate

deep learningneural networks

I am using the ADAM optimizer at the moment with a learning rate of 0.001 and a weight decay value of 0.005.

I understand that weight decay reduces the weights values over time and that the learning rate modifies to weight in the right direction. Does it makes sense to have a higher weight decay value than learning rate? Or will the weights go faster to zero, than that they learn?

Best Answer

Training a neural network means minimizing some error function which generally contains 2 parts: a data term (which penalizes when the network gives incorrect predictions) and a regularization term (which ensures the network weights satisfy some other assumptions), in our case the weight decay penalizing weights far from zero. The error function may look like this:

$E=\frac{1}{N}||\mathbf{y}-\mathbf{t}||_2 + \lambda ||w||_2$,

where $\mathbf{y}$ are the network predictions, $\mathbf{t}$ are the desired outputs (ground truth), $N$ is the size of the training set, and $w$ is the vector of the network weights. The parameter $\lambda$ controls the relative importance of the two parts of the error function. Setting a weight decay corresponds to setting this parameter. If you set it to a high value, the network does not care so much about correct predictions on the training set and rather keeps the weights low, hoping for good generalization performance on the unseen data.

How the error function is minimized is an entirely separate thing. You can use a fancy method such as Adam, or you can take a simple stochastic gradient descent: both work on the same iterative principle:

  1. Evaluate derivatives of the error function w.r.t. weights: $\frac{\partial E}{\partial w}$

  2. Update weights in the negative direction of the derivatives by a small step.

It can be written down like this:

$w_{t+1} = w_t - \eta \frac{\partial E}{\partial w}$

Parameter $\eta$ is called learning rate: it controls the size of the step.


Thus, these two parameters are independent of each other and in principle it can make sense to set weight decay larger than learning rate. Practically, it depends entirely on your specific scenario: Which network architecture are you using? How many weights are there? What is the error function? Are you using some other regularizers? etc. It is your job to find the right hyperparameters.