Solved – Are discrete single value prior distributions always lost in MAP estimation

bayesiandistributionsmaximum likelihoodposteriorprior

I’d like to illustrate my problem with a little (heavily abbreviated) excercise. I think it will help a lot to stress my point.

Meet Mary, Tom
and Jane. They all are programmers. Mary is a decent programmer. In
writing five programs she usually makes about 3 mistakes. Tom on the other
hand is quite bad. He makes about one mistake in every program. Jane is by
far the best. In ten programs she writes you will only find one mistake.
On one day the their boss checks five programs written by the same person
and finds two mistakes. He wonders who might have written that program. He
knows that Jane is the only one who has a full-time job and hence it is two
times as likely that she has written these five programs. He thinks that
he might use a Poisson distribution and some basic Bayesian point
parameter estimation to find an answer.

The likelihood-function for all three would be:
$$f(X = 2 | \mu_i) = \frac{\mu_i^2}{2}·e^{-\mu_i}$$ with $$\mu_1 = 3,
\mu_2 = 5, \mu_3 = 0.5$$ for Mary, Tom and Jane.

Our prior distribution is based on information regarding part-time and full-time jobs:
$$ P(\mu) = \left\{
\begin{array}{l l}
1/2 & \quad \text{if $\mu = 0.5$}\\
1/4 & \quad \text{if $\mu = 3$ or $\mu = 5$}\\
0 & \quad \text{else}
\end{array} \right.$$

Now, let’s write the posterior distribution down:

$$f(\mu_j | X = 2) = \frac{f(X = 2 | \mu_j) · P(\mu_j)}{\sum_{i=1}^3f(X = 2 | \mu_i) · P(\mu_i)}$$

Further, let us use MAP estimation (i.e. – for readers not familiar with this terminology – maximum a posteriori estimation whereby we use the mode of the posterior distribution.) When we want to find the mode of the posterior distribution for our little example we will find that Mary is our (un)lucky star. In determining this we weighted the likelihood $L(\mu) = f(X=2|\mu_j)$ by $P(\mu_j)$. Hence, this information (however insignificant) influences our maximum estimation.

But now we want to use MAP to find our parameter estimate for $\mu$. In general it holds that when using MAP it actually doesn’t matter whether we include the denominator of the posterior distribution to find the maximum or not. Hence, we can only maximize the numerator $$\underset{\mu}{\arg \max}\;f(X = 2 | \mu_j) · P(\mu_j)$$
We would then form the weighted loglikelihood $$\log(f(X = 2 | \mu_j) · P(\mu_j))$$ and take the derivative of the loglikelihood. And here is where it get’s interesting for me. Whenever we have a discrete distribution that is only given in terms of a few values (in our example $\mu_i$) then $P(\mu_i)$ will become zero when we take the derivative. In our example the derivative will be: $$\frac{\partial\log(L(\mu))}{\partial \mu}=\frac{2}{\mu} – 1$$ which is equivalent to MLE (i.e. maximum likelihood estimation) although we used an informative prior. In short: What we can see is that although in our estimation of the mode of the posterior distribution we weighted the likelihood with the probability of the person having produced that program based on whether they have a full-time or part-time job this information is lost in estimating the parameter with MAP.

Questions:

(1) Do we always lose the a priori information for MAP parameter estimation when using only single discrete values in our formula or am I missing something here?

(2) Is this situation remedied when I model the three values by a distribution/formula that is dependent on $\mu$ but which also only allows the three values for $\mu_i$?

Thanks for any help!

Best Answer

The unnormalized posterior (prior times likelihood, i.e., the numerator) in your example is \begin{equation} f(X=2\mid \mu)\cdot P(\mu) = \left\{ \begin{array}{c} \frac{\mu_j^2}{2}e^{-\mu_j} \cdot P(\mu_j), & \quad ~\mu=\mu_j,j\in\{1,2,3\} \\ 0, & \quad \mu \notin \{\mu_1,\mu_2,\mu_3\} \end{array} \right. \end{equation} the logarithm of this is \begin{equation} \left\{ \begin{array}{c} -\log(2) + 2\log(\mu_j) - \mu_j + \log(P(\mu_j)) & \quad ~\mu=\mu_j,j\in\{1,2,3\} \\ -\infty & \quad \mu \notin \{\mu_1,\mu_2,\mu_3\} \end{array} \right., \end{equation} i.e., the logarithm is undefined except at these 3 points. Therefore, you cannot take the derivative with respect to $\mu$ (your error lies in thinking you can take the derivative and omit the $P(\mu)$ factor, but this would correspond to using a uniform prior for $\mu$). Instead, you maximize the posterior by computing the posterior, or the logarithm, at those 3 points where it is nonzero and picking the maximum. In this case, the computation using logarithms works out as follows: \begin{equation} \begin{array}{cccc} \textrm{Programmer} & \mu & \textrm{Log-likelihood} & \textrm{Log-prior} & \textrm{Sum} \\ \textrm{Mary} & 3 & -1.50 & -1.39 & \mathbf{-2.88} \\ \textrm{Tom} & 5 & -2.47 & -1.39 & -3.86 \\ \textrm{Jane} & 0.5 & -2.58 & -0.69 & -3.27 \\ \textrm{-} & \textrm{other} & -\log(2)+2\log(\mu)-\mu & -\infty & -\infty \end{array} \end{equation} i.e., $\mu=3$ (Mary) is indeed the MAP estimate.