L2 Regularization – Why It Is Equivalent to Gaussian Prior

referencesregressionregularization

I keep reading this and intuitively I can see this but how does one go from L2 regularization to saying that this is a Gaussian Prior analytically? Same goes for saying L1 is equivalent to a Laplacean prior.

Any further references would be great.

Best Answer

Let us imagine that you want to infer some parameter $\beta$ from some observed input-output pairs $(x_1,y_1)\dots,(x_N,y_N)$. Let us assume that the outputs are linearly related to the inputs via $\beta$ and that the data are corrupted by some noise $\epsilon$:

$$y_n = \beta x_n + \epsilon,$$

where $\epsilon$ is Gaussian noise with mean $0$ and variance $\sigma^2$. This gives rise to a Gaussian likelihood:

$$\prod_{n=1}^N \mathcal{N}(y_n|\beta x_n,\sigma^2).$$

Let us regularise parameter $\beta$ by imposing the Gaussian prior $\mathcal{N}(\beta|0,\lambda^{-1}),$ where $\lambda$ is a strictly positive scalar ($\lambda$ quantifies of by how much we believe that $\beta$ should be close to zero, i.e. it controls the strength of the regularisation). Hence, combining the likelihood and the prior we simply have:

$$\prod_{n=1}^N \mathcal{N}(y_n|\beta x_n,\sigma^2) \mathcal{N}(\beta|0,\lambda^{-1}).$$

Let us take the logarithm of the above expression. Dropping some constants we get:

$$\sum_{n=1}^N -\frac{1}{\sigma^2}(y_n-\beta x_n)^2 - \lambda \beta^2 + \mbox{const}.$$

If we maximise the above expression with respect to $\beta$, we get the so called maximum a-posteriori estimate for $\beta$, or MAP estimate for short. In this expression it becomes apparent why the Gaussian prior can be interpreted as a L2 regularisation term.


The relationship between the L1 norm and the Laplace prior can be understood in the same fashion. Instead of a Gaussian prior, multiply your likelihood with a Laplace prior and then take the logarithm.


A good reference (perhaps slightly advanced) detailing both issues is the paper "Adaptive Sparseness for Supervised Learning", which currently does not seem easy to find online. Alternatively look at "Adaptive Sparseness using Jeffreys Prior". Another good reference is "On Bayesian classification with Laplace priors".