Solved – L2 Regularization Constant

neural networksregularization

When implementing a neural net (or other learning algorithm) often we want to regularize our parameters $\theta_i$ via L2 regularization. We do this usually by adding a regularization term to the cost function like so:
$$\text{cost}=\frac{1}{m}\sum_{i=0}^m\text{loss}_m +\frac{\lambda}{2m}\sum_{i=1}^n (\theta_i)^2$$

We then proceed to minimize this cost function and hopefully when we've reached a minimum, we get a model which has less overfitting than a model without regularization. As far as I know, this is the L2 regularization method (and the one implemented in deep learning libraries). Let me know if I have made any errors.

My question is this: since the regularization factor has nothing accounting for the total number of parameters in the model, it seems to me that with more parameters, the larger that second term will naturally be. If a model has 300 million parameters, for example, and I set $\lambda=1$, that second term might be huge. Is it standard practice, then, to reduce $\lambda$ in some way to account for the massive number of parameters in the model, or is it ok to simply accept starting off with a huge cost? It seems to me that if we do not somehow scale $\lambda$ inversely with the number of parameters, that using a huge number of parameters, while keeping $\lambda$ constant, means that we will have a much stronger regularization effect since the second term will force the parameters $\theta_i \approx 0$ with much more rigor. That second term will dominate over the first term. I have not seen any mention of doing this in any of the resources that I've come across though, so I wonder if my analysis is fundamentally wrong somewhere.

Best Answer

You are absolutely right in your observation that the number of parameters will affect the regularization cost.

I don't think there are any rule-of-thumb values for $\lambda$ (but $\lambda=1$ would be considered large). If cross-validation is too time-consuming, you could hold-out a part of the training data and tune $\lambda$ using early stopping. You would still need to try several values for $\lambda$ common practice is to try something like $0.01, 0.02,\ldots,0.4$.

For really large networks, it might be more convenient to use other regularization methods, like dropout, instead of $\ell_2$ regularization.