Solved – Should L2 regularization be corrected for scale

deep learningregularization

I am training a deep neural network using cross entropy loss and L2 regularization, so the final cost function looks something like this:
$$E = – \frac{1}{N_{samples}} \sum_{i=1}^{N_{samples}} \text{cross_entropy}\left(x_i, y_i\right) + \lambda \sum_{j=1}^{N_{layers}}\sum_{k=1}^{N_{units}^j}\sum_{l=1}^{N_{units}^{j+1}} \left(w^j_{k,l}\right)^2$$
where the first term is the cross entropy over classes (averaged over the size of the training set) and the second term is the sum of squared weights involved in the network ($w^j_{k,l}$ is the weight from $k$-th unit in $j$-th layer to $l$-th unit in $(j+1)$-th layer), and $\lambda$ is a regularization strength parameter.

My question is: won't the number of layers and units affect the scale of the regularization term ?
Therefore, wouldn't it make more sense to normalize the second term by the number of weights (i.e., replacing $\frac{\lambda}{N_{layers}N^j_{units}N^{j+1}_{units}}$ for $\lambda$
).

Unfortunately, I've not found any reference about this. I've just found in Bengio's paper [1] (weight decay subsection) that they recommend to scale according to the number of mini-batches in each epoch (which I do not really see the reason why).

[1] Practical Recommendations for Gradient-Based Training of Deep Architectures

Best Answer

Recall where this term actually comes from: the amount of weight decay we want to have for each weight at every iteration

del(E)/del(w(j,k,l))= del(left cross entropy error) + λ*(w(j,k,l)

say the λ is 0.001 essentially it means if Error is not affected by this particular weight, decay it by 0.1%. The number of units and layers does not affect the decay we want in a particular weight.

In his paper I don't think he is normalizing loss by n samples as you are doing: enter image description here

but rather adding a 1/nsamples term to the gradient where nsamples is the minibatch size. So effectively his λ[his]= λ[yours]*nsamples. As you correctly said λ[yours] doesn't depend on mini batch size, but λ[his] depends, so he assumes λ is the regulartization parameter he would use for full batch gradient descent, and normalizes it by (nsamples/training set size) if doing it with mini-batches.