Solved – Why is regularization interpreted as a gaussian prior on the weights

machine learningmaximum likelihoodnormal distributionposteriorregularization

Going through this online class, the notes specify the following. (In the highlights):

enter image description here

I understand how we get a maximum likelihood interpretation. What I do not get is:

Why does using the Frobenious norm of the weight matrix $R(W)$ as a regularizer, lead to the interpretation of the weight matrix $W$ having a prior as being sampled from a gaussian? How is this arrived at? And what if the regularization term was something other than the frobenius norm?

(And to that last point, does this mean that every element in the matrix $W$ is sampled from a gaussian?)

Thank you.

Best Answer

Since we're using MAP we are trying to maximize the probability of the parameters given the data.

$$ P(W|x,y) = \frac {P(x,y|W) P(W)} {P(x,y)} $$

$P(x,y)$ can be ignored since it's fixed for our data. So we are trying to maximize $log(P(x,y|W)) + log(P(W))$. Let's look at $P(W)$.

If each element of $W$ is drawn independently from a unit Gaussian, the probability of the matrix is

$$ P(W) = \frac {1} {\sqrt{2 \pi}} \prod_{ij} exp\Big( -\frac {w_{ij}^2} 2 \Big) $$

so $log(P(W))$ is $$ -\frac 1 2 \sum_{ij}{w_{ij}}^2 $$

plus some constant terms. The log-likelihood is then

$$ log(P(x,y|W)) - \frac 1 2 \sum_{ij} {{w_{ij}}^2} $$

Since we are thinking in terms of losses we negate the log-likelihood and minimize. Now the first term is cross-entropy and the second term is $\frac 1 2 R(W)$.

If you have a (reasonable in some sense that I don't know how to define) more or less arbitrary penalty function you can obtain a density for it by integrating over its domain and then normalizing to obtain a density. In this case it's just easy to see that it's a Gaussian.

This does not mean that the elements of $W$ actually are sampled from a Gaussian. What it means is that you believe that's what $W$ looks like before you have any evidence to the contrary. In other words the prior on the elements of $W$ is a Gaussian.