Solved – MAP estimation as regularisation of MLE

extreme valuemaximum likelihoodposteriorpriorregularization

Going through the Wikipedia article on Maximum a posteriori estimation, it got confusing after reading this:

It is closely related to the method of maximum likelihood (ML) estimation, but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

How can the MAP estimation be seen as a regularization of ML estimation?

EDIT:

My understanding of regularization being penalizing high weights in the context of Machine learning. That being done through modifying the optimization problem by adding a term in the loss function which contains the weights to be learned. And the objective being minimization of loss, the parameters with higher values get penalized more.

An intuitive explanation is very welcome.

Best Answer

Maximum likelihood method aims at finding model parameters that best match some data:

$$ \theta_{ML}=\mathrm{argmax}_\theta \,p(x|y,\theta) $$

Maximum likelihood does not use any prior knowledge about the expected distribution of the parameters $\theta$ and thus may overfit to the particular data $x$, $y$.

Maximum a-posteriori (MAP) method adds a prior distribution of the parameters $\theta$:

$$ \theta_{MAP}=\mathrm{argmax}_\theta \, p(x|y,\theta)p(\theta) $$ The optimal solution must still match the data but it has also to conform to your prior knowledge about the parameter distribution.

How is this related to adding a regularizer term to a loss function?

Instead of optimizing the posterior directly, one often optimizes negative of the logarithm instead:

$$ \begin{align} \theta_{MAP}&=\mathrm{argmin}_\theta \, -\log p(x|y,\theta)p(\theta) \\ &=\mathrm{argmin}_\theta \, -(\log p(x|y,\theta) + \log p(\theta)) \end{align} $$

Assuming you want the parameters $\theta$ to be normally distributed around zero, you get $\log p(\theta) \propto ||\theta||_2^2$.