Solved – Why does L2 regularization smooth the loss surface

neural networksoptimizationregularization

Fitting neural nets with L2 penalization, I've noticed that I often attain lower in-sample mean-squared errors with higher rates of L2 "weight decay", then I do with lower rates of L2 weight decay.

Say I train a network to convergence using a small $\lambda$ — say $2^{-8}$ — save the weights and then use them to initialize a network with $\lambda = 2^{-7}$, moving upwards until $\lambda$ gets so big that all the weights go to zero.

One would expect in-sample MSE to increase monotonically. But it doesn't.

I guess this reflects the algorithm being more prone to local minima defined by "crevasses" in a few dimensions. Is that the case?

Is this phenomenon general, and therefor able to be formalized?

Best Answer

Slight addendum to my previous answer: The more I think about what you have written, the more I get the feeling you have made a mistake somewhere. Let $L(x,W)$ be the loss function and $R(W,\lambda)$ be the regularization. For $\lambda_{1} < \lambda_{2}$, for the 2-norm, if $W$ is the optimal weight with regularization $\lambda_{2}$ we have

$R(W,\lambda_{2}) \geq R(W,\lambda_{1})$

Therefore, given a ${\it{fixed}}$ input, $x$,

$L(x,W) + R(W,\lambda_{2}) \geq L(x,W) + R(W,\lambda_{1})$.

Therefore

$L(x,W) + R(W,\lambda_{2}) \geq inf_{w} \mbox{ } L(x,W) + R(W,\lambda_{1})$

Therefore the loss absolutely has to be a monotone function of $\lambda$. Please check your code; either there is a mistake or you have not chosen the starting points properly

Related Solutions

Solved – Applying L1, L2 and Tikhonov Regularization to Neural Nets: Possible Misconceptions

If you are interested in pursuing neural networks (or other machine learning), you would be well served to spend some time on the mathematical fundamentals. I started skimming this trendy new treatise, which IMO is not too difficult mathematically (although I have seen some complain otherwise, and I was a math major, so YMMV!).

That said, I will try to address your questions as best I can at a high level.

Can L1, L2 and Tikhonov calculations be appended as extra terms directly to a cost function, such as MSE?

Yes, these are all penalty methods which add terms to the cost function.

If #1 is correct, is is possible to apply them in supervised learning situations where there are no cost functions?

I am not sure what you are thinking here, but in my mind the supervised case is the most clear in terms of cost functions and regularization. In general you can think of the error conceptually as

$$\text{Total Error}=\left(\text{ Data Error }\right) + \left(\text{ Prior Error }\right)$$

where the first term penalizes the misfit of the model predictions vs. the training data, while the second term penalizes overfitting of the model (with the goal of lowering generalization errors).

For a problem where features $x$ are used to predict data $y$ via a function $f(x,w)$ parameterized by weights $w$, the above cost function will typically be realized mathematically by something like

$$E[w]=\|\,f(x,w)-y\,\|_p^p + \|\Lambda w\|_q^q$$

Here the terms have the same interpretation as above, with the errors measured by "$L_p$ norms": The data misfit usually has $p=2$, which corresponds to MSE, while the regularization term may use $q=2$ or $q=1$. (The $\Lambda$ I will get to below.)

The latter corresponds to absolute(-value) errors rather than square errors, and is commonly used to promote sparsity in the solution vector $w$ (i.e. many 0 weights, effectively limiting the connectivity paths). The $L_1$ norm can be used for the data misfit term also, typically to reduce sensitivity to outliers in the data $y$. (Conceptually, predictions will target the median of $y$ rather than the mean of $y$.)

Note that in many unsupervised learning scenarios there are effectively two "machines", with each taking turns as "data" vs. "prediction" (e.g. the coder and decoder parts of an autoencoder).

I've gathered that L2 is a special case of Tikhonov Regularization ...

The two can be used as synonyms. Some communities tend to use "$L_2$" to refer to the special case where the Tikhonov matrix $\Lambda$ is simply a scalar $\lambda$. Terminology varies quite a bit.

L2 is differentiable and therefore compatible with gradient descent ... however, it appears that we can substitute other operations like difference and Fourier operators to derive other brands of Tikhonov Regularization that are not equivalent to L2.

This is not correct. Tikhonov regularization always uses the $L_2$ norm, so is always a differentiable $L_2$ regularization. The matrix $\Lambda$ does not impact this (it is constant, it does not matter if it is a scalar, a diagonal covariance, a finite difference operator, a Fourier transform, etc.).

For differentiability, the comparison is typically between $L_2$ and $L_1$. You can think of it like $$(x^2)'=2x \quad \text{ vs. } \quad |x|'=\mathrm{sgn}(x)$$ i.e. the $L_2$ norm has a continuous derivative while the $L_1$ norm has a discontinuous derivative.

This difference is not as big as you might imagine in terms of optimize-ability. For example the popular ReLU activation function also has a discontinuous derivative, i.e. $$\max(0,x)'=(x>0)$$ More importantly, the $L_1$ regularizer is convex, so helps suppres local minima in the composite cost function. (Note I am avoiding technical issues around convex optimization & differentiability, such as "subgradients", as these are not essential to the point.)

If #4 is true, can someone post an example of a Tikhonov formula that is not equivalent to L2, yet would be useful in neural nets?

This was answered above: No, because all Tikhonov regularizers use the $L_2$ norm. (Wether they are useful for NN or not!)

There's are arcane references in the literature to a "Tikhonov Matrix," which I can't seem to find any definitions of (although I've run across several unanswered questions scattered across the Internet about its meaning). An explanation of its meaning would be helpful.

This is simply the matrix $\Lambda$. There are many forms that it can take, depending on the goals of the regularization:

In the simplest case it is simply a constant ($\lambda>0$). This penalizes large weights (i.e. promotes $w_i^2\approx 0$ on average), which can be useful to prevent over-fitting.
In the next simplest case, $\Lambda$ is diagonal, which allows per-weight regularization (i.e. $\lambda_iw_i^2\approx 0$). For example the regularization might vary with level in a deep network.
Many other forms are possible, so I will end with one example of a sparse but non-diagonal $\Lambda$ that is common: A finite difference operator. This only makes sense when the weights $w$ are arranged in some definite spatial pattern. In this case $\Lambda w\approx \nabla w$, so the regularization promotes smoothness (i.e. $\nabla w\approx 0$). This is common in image processing (e.g. tomography), but could conceivably be applied in some types of neural-network architectures as well (e.g. ConvNets).

In the last case, note that this type of "Tikhonov" matrix $\Lambda$ can really be used in $L_1$ regularization as well. A great example I did once was with the smoothing example in an image processing context: I had a least squares ($L_2$) cost function set up to do denoising, and by literally just putting the same matrices into a "robust least squares" solver* I got an edge-preserving smoother "for free"! (*Essentially this just changed the $p$'s and $q$'s in my first equation from 2 to 1.)

I hope this answer has helped you to understand regularization better! I would be happy to expand on any part that is still not clear.

Solved – How to interpret smooth l1 loss

Smooth L1-loss can be interpreted as a combination of L1-loss and L2-loss. It behaves as L1-loss when the absolute value of the argument is high, and it behaves like L2-loss when the absolute value of the argument is close to zero. The equation is:

$L_{1;smooth} = \begin{cases}|x| & \text{if $|x|>\alpha$;} \\ \frac{1}{|\alpha|}x^2 & \text{if $|x| \leq \alpha$}\end{cases}$

$\alpha$ is a hyper-parameter here and is usually taken as 1. $\frac{1}{\alpha}$ appears near $x^2$ term to make it continuous.

Smooth L1-loss combines the advantages of L1-loss (steady gradients for large values of $x$) and L2-loss (less oscillations during updates when $x$ is small).

Another form of smooth L1-loss is Huber loss. They achieve the same thing. Taken from Wikipedia, Huber loss is

$ L_\delta (a) = \begin{cases} \frac{1}{2}{a^2} & \text{for } |a| \le \delta, \\ \delta (|a| - \frac{1}{2}\delta), & \text{otherwise.} \end{cases} $

Best Answer

Related Solutions

Solved – Applying L1, L2 and Tikhonov Regularization to Neural Nets: Possible Misconceptions

Solved – How to interpret smooth l1 loss

Related Question