Bayesian – Example of Maximum A Posteriori Estimation

bayesianestimationposterior

I have been reading about maximum likelihood estimation and maximum a posteriori estimation and so far I have met concrete examples only with maximum likelihood estimation. I have found some abstract examples of maximum a posteriori estimation, but nothing concrete yet with numbers on it :S

It can be very overwhelming, working only with abstract variables and functions, and in order not to drown in this abstractness, it's nice to relate things to the real world from time to time. But of course, this is just my (and some other peoples') observation 🙂

Therefore, could anyone give me a simple, but concrete example of Maximum A Posteriori estimation with numbers on it? That would help a lot 🙂

Thank you!

I have originally posted this question on MSE, but couldn't get an answer there:

https://math.stackexchange.com/questions/449386/example-of-maximum-a-posteriori-estimation

I have followed the instructions given here on cross posting:

http://meta.math.stackexchange.com/questions/5028/how-do-i-move-a-post-to-another-forum-like-cv-stats

Best Answer

1st Example

A typical case is tagging in the context of natural language processing. See here for a detailed explanation. The idea is basically to be able to determine the lexical category of a word in a sentence (is it a noun, an adjective,...). The basic idea is that you have a model of your language consisting on a hidden markov model (HMM). In this model, the hidden states correspond to the lexical categories, and the observed states to the actual words.

The respective graphical model has the form,

graphical model of a canonical HMM

where $\mathbf{y} = (y1,...,y_{N})$ is the sequence of words in the sentence, and $\mathbf{x} = (x1,...,x_{N})$ is the sequence of tags.

Once trained, the goal is to find the correct sequence of lexical categories that correspond to a given input sentence. This is formulated as finding the sequence of tags which are most compatible/most likely to have been generated by the language model, i.e.

$$f(y) = \mathbf{argmax}_{\mathbf{x} \in Y}p(\mathbf{x})p(\mathbf{y}|\mathbf{x})$$

2nd Example

Actually, a better example would be regression. Not only because it is easier to understand, but also because makes the differences between maximum likelihood (ML) and maximum a posteriori (MAP) clear.

Basically, the problem is that of fitting some function given by the samples $t$ with a linear combination of a set of basis functions, $$y(\mathbf{x};\mathbf{w}) = \sum_{i}w_{i}\phi_{i}(\mathbf{x})$$ where $\phi(\mathbf{x})$ are the basis functions, and $\mathbf{w}$ are the weights. It is usually assumed that the samples are corrupted by Gaussian noise. Hence, if we assume that the target function can be exactly written as such a linear combination, then we have,

$$t = y(\mathbf{x};\mathbf{w}) + \epsilon$$

so we have $p(t|\mathbf{w}) = \mathcal{N}(t|y(\mathbf{x};\mathbf{w}))$ The ML solution of this problem is equivalent to minimizing,

$$E(\mathbf{w}) = \frac{1}{2}\sum_{n}\left(t_{n} - \mathbf{w}^{T}\phi(\mathbf{x}_{n}) \right)^{2}$$

which yields the well-known least square error solution. Now, ML is sentitive to noise, and under certain circumstances not stable. MAP allows you to pick up better solutions by putting constraints on the weights. For example, a typical case is ridge regression, where you demand the weights to have a norm as small as possible,

$$E(\mathbf{w}) = \frac{1}{2}\sum_{n}\left(t_{n} - \mathbf{w}^{T}\phi(\mathbf{x}_{n}) \right)^{2} + \lambda \sum_{k}w_{k}^{2}$$

which is equivalent to setting a Gaussian prior on the weights $\mathcal{N}(\mathbf{w}|\mathbf{0},\lambda^{-1}\mathbf{I})$. In all, the estimated weights are

$$\mathbf{w} = \mathbf{argmin}_{w}p(\mathbf{w};\lambda)p(t|\mathbf{w};\phi)$$

Notice that in MAP the weights are not parameters as in ML, but random variables. Nevertheless, both ML and MAP are point estimators (they return an optimal set of weights, rather than a distribution of optimal weights).