Bayesian – Example of Maximum A Posteriori Estimation

bayesianestimationposterior

I have been reading about maximum likelihood estimation and maximum a posteriori estimation and so far I have met concrete examples only with maximum likelihood estimation. I have found some abstract examples of maximum a posteriori estimation, but nothing concrete yet with numbers on it :S

It can be very overwhelming, working only with abstract variables and functions, and in order not to drown in this abstractness, it's nice to relate things to the real world from time to time. But of course, this is just my (and some other peoples') observation 🙂

Therefore, could anyone give me a simple, but concrete example of Maximum A Posteriori estimation with numbers on it? That would help a lot 🙂

Thank you!

I have originally posted this question on MSE, but couldn't get an answer there:

https://math.stackexchange.com/questions/449386/example-of-maximum-a-posteriori-estimation

I have followed the instructions given here on cross posting:

http://meta.math.stackexchange.com/questions/5028/how-do-i-move-a-post-to-another-forum-like-cv-stats

Best Answer

1st Example

A typical case is tagging in the context of natural language processing. See here for a detailed explanation. The idea is basically to be able to determine the lexical category of a word in a sentence (is it a noun, an adjective,...). The basic idea is that you have a model of your language consisting on a hidden markov model (HMM). In this model, the hidden states correspond to the lexical categories, and the observed states to the actual words.

The respective graphical model has the form,

graphical model of a canonical HMM

where $\mathbf{y} = (y1,...,y_{N})$ is the sequence of words in the sentence, and $\mathbf{x} = (x1,...,x_{N})$ is the sequence of tags.

Once trained, the goal is to find the correct sequence of lexical categories that correspond to a given input sentence. This is formulated as finding the sequence of tags which are most compatible/most likely to have been generated by the language model, i.e.

$$f(y) = \mathbf{argmax}_{\mathbf{x} \in Y}p(\mathbf{x})p(\mathbf{y}|\mathbf{x})$$

2nd Example

Actually, a better example would be regression. Not only because it is easier to understand, but also because makes the differences between maximum likelihood (ML) and maximum a posteriori (MAP) clear.

Basically, the problem is that of fitting some function given by the samples $t$ with a linear combination of a set of basis functions, $$y(\mathbf{x};\mathbf{w}) = \sum_{i}w_{i}\phi_{i}(\mathbf{x})$$ where $\phi(\mathbf{x})$ are the basis functions, and $\mathbf{w}$ are the weights. It is usually assumed that the samples are corrupted by Gaussian noise. Hence, if we assume that the target function can be exactly written as such a linear combination, then we have,

$$t = y(\mathbf{x};\mathbf{w}) + \epsilon$$

so we have $p(t|\mathbf{w}) = \mathcal{N}(t|y(\mathbf{x};\mathbf{w}))$ The ML solution of this problem is equivalent to minimizing,

$$E(\mathbf{w}) = \frac{1}{2}\sum_{n}\left(t_{n} - \mathbf{w}^{T}\phi(\mathbf{x}_{n}) \right)^{2}$$

which yields the well-known least square error solution. Now, ML is sentitive to noise, and under certain circumstances not stable. MAP allows you to pick up better solutions by putting constraints on the weights. For example, a typical case is ridge regression, where you demand the weights to have a norm as small as possible,

$$E(\mathbf{w}) = \frac{1}{2}\sum_{n}\left(t_{n} - \mathbf{w}^{T}\phi(\mathbf{x}_{n}) \right)^{2} + \lambda \sum_{k}w_{k}^{2}$$

which is equivalent to setting a Gaussian prior on the weights $\mathcal{N}(\mathbf{w}|\mathbf{0},\lambda^{-1}\mathbf{I})$. In all, the estimated weights are

$$\mathbf{w} = \mathbf{argmin}_{w}p(\mathbf{w};\lambda)p(t|\mathbf{w};\phi)$$

Notice that in MAP the weights are not parameters as in ML, but random variables. Nevertheless, both ML and MAP are point estimators (they return an optimal set of weights, rather than a distribution of optimal weights).

Related Solutions

Solved – Variance-gamma distribution: parameter estimation

1. In pp. 16 they mention that the package uses "BFGS" or "Nelder-Mead". The second one is the default option. Please, take a look at these links to see their differences. Optimisation is actually made on the likelihood function. This is, the fitted parameters in the output of vgFit are actually the maximum likelihood estimators.

2. It is difficult to tell which optimisation method is better in general. You can instead compare different methods and see if the results coincide using the command optim and the command vgFit. Next, I present a code for maximising the likelihood function, you can choose between 6 different optimisation methods.

library("VarianceGamma")

# Simulate 100 observations from a variance-gamma distribution with parameters (0,1,0,1)
data = rvg(100, vgC = 0, sigma = 1, theta = 0, nu = 1)

# -log-likelihood function using the Variance-Gamma package
ll = function(par){
if(par[2]>0&par[4]>0) return( - sum(log(dvg(data, vgC = par[1], sigma=par[2],   
   theta=par[3], nu = par[4]) )))
else return(Inf)}

# Direct maximisation/minimisation using the command optim
optim(c(0,1,0,1),ll,method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"))

# Maximisation using the command vgFit
vgFit(data)

The advantage of vgFit is that you do not need to specify the starting value for searching the optimum. It implements three different methods for doing so: "US", "SL" and "MoM". "SL" is the default method. I would not trust blindly on these methods, I would rather compare the results from vgFit and optim.

3. You can check the references in the manual. For example

Seneta, E. (2004). Fitting the variance-gamma model to financial data. J. Appl. Prob., 41A:177– 187.

Kotz, S, Kozubowski, T. J., and Podgórski, K. (2001). The Laplace Distribution and Generalizations. Birkhauser, Boston, 349 p.

D.B. Madan and E. Seneta (1990): The variance gamma (V.G.) model for share market returns, Journal of Business, 63, pp. 511–524.

Solved – Are discrete single value prior distributions always lost in MAP estimation

The unnormalized posterior (prior times likelihood, i.e., the numerator) in your example is \begin{equation} f(X=2\mid \mu)\cdot P(\mu) = \left\{ \begin{array}{c} \frac{\mu_j^2}{2}e^{-\mu_j} \cdot P(\mu_j), & \quad ~\mu=\mu_j,j\in\{1,2,3\} \\ 0, & \quad \mu \notin \{\mu_1,\mu_2,\mu_3\} \end{array} \right. \end{equation} the logarithm of this is \begin{equation} \left\{ \begin{array}{c} -\log(2) + 2\log(\mu_j) - \mu_j + \log(P(\mu_j)) & \quad ~\mu=\mu_j,j\in\{1,2,3\} \\ -\infty & \quad \mu \notin \{\mu_1,\mu_2,\mu_3\} \end{array} \right., \end{equation} i.e., the logarithm is undefined except at these 3 points. Therefore, you cannot take the derivative with respect to $\mu$ (your error lies in thinking you can take the derivative and omit the $P(\mu)$ factor, but this would correspond to using a uniform prior for $\mu$). Instead, you maximize the posterior by computing the posterior, or the logarithm, at those 3 points where it is nonzero and picking the maximum. In this case, the computation using logarithms works out as follows: \begin{equation} \begin{array}{cccc} \textrm{Programmer} & \mu & \textrm{Log-likelihood} & \textrm{Log-prior} & \textrm{Sum} \\ \textrm{Mary} & 3 & -1.50 & -1.39 & \mathbf{-2.88} \\ \textrm{Tom} & 5 & -2.47 & -1.39 & -3.86 \\ \textrm{Jane} & 0.5 & -2.58 & -0.69 & -3.27 \\ \textrm{-} & \textrm{other} & -\log(2)+2\log(\mu)-\mu & -\infty & -\infty \end{array} \end{equation} i.e., $\mu=3$ (Mary) is indeed the MAP estimate.

Best Answer

Related Solutions

Solved – Variance-gamma distribution: parameter estimation

Solved – Are discrete single value prior distributions always lost in MAP estimation

Related Question