Solved – Why does L2 regularization smooth the loss surface

neural networksoptimizationregularization

Fitting neural nets with L2 penalization, I've noticed that I often attain lower in-sample mean-squared errors with higher rates of L2 "weight decay", then I do with lower rates of L2 weight decay.

Say I train a network to convergence using a small $\lambda$ — say $2^{-8}$ — save the weights and then use them to initialize a network with $\lambda = 2^{-7}$, moving upwards until $\lambda$ gets so big that all the weights go to zero.

One would expect in-sample MSE to increase monotonically. But it doesn't.

I guess this reflects the algorithm being more prone to local minima defined by "crevasses" in a few dimensions. Is that the case?

Is this phenomenon general, and therefor able to be formalized?

Best Answer

Slight addendum to my previous answer: The more I think about what you have written, the more I get the feeling you have made a mistake somewhere. Let $L(x,W)$ be the loss function and $R(W,\lambda)$ be the regularization. For $\lambda_{1} < \lambda_{2}$, for the 2-norm, if $W$ is the optimal weight with regularization $\lambda_{2}$ we have

$R(W,\lambda_{2}) \geq R(W,\lambda_{1})$

Therefore, given a ${\it{fixed}}$ input, $x$,

$L(x,W) + R(W,\lambda_{2}) \geq L(x,W) + R(W,\lambda_{1})$.

Therefore

$L(x,W) + R(W,\lambda_{2}) \geq inf_{w} \mbox{ } L(x,W) + R(W,\lambda_{1})$

Therefore the loss absolutely has to be a monotone function of $\lambda$. Please check your code; either there is a mistake or you have not chosen the starting points properly

Related Question