Solved – What Is Meant by “Maximising” Posterior Probability

bayesianinformation theorylikelihoodmaximum likelihoodposterior

My textbook says the following:

enter image description here

The optimal coding decision (optimal in the sense of having the smallest probability of being wrong) is to find which value of $\mathbf{s}$ is most probable, given $\mathbf{r}$. Consider the decoding of a single bit $s$, which was encoded as $\mathbf{t}(s)$ and gave rise to three received bits $\mathbf{r} = r_1 r_2 r_3$. By Bayes' theorem, the posterior probability of $s$ is

$$P(s | r_1 r_2 r_3) = \dfrac{P(r_1 r_2 r_3 | s) P(s)}{P(r_1 r_2 r_3)}. \tag{1.18}$$

We can spell out the posterior probability of the two alternatives thus:

$$P(s = 1 | r_1 r_2 r_3) = \dfrac{P(r_1 r_2 r_3 | s = 1) P(s = 1)}{P(r_1 r_2 r_3)}; \tag{1.19}$$

$$P(s = 0 | r_1 r_2 r_3) = \dfrac{P(r_1 r_2 r_3 | s = 0) P(s = 0)}{P(r_1 r_2 r_3)}. \tag{1.20}$$

This posterior probability is determine by two factors: the prior probability $P(s)$, and the data-dependent term $P(r_1 r_2 r_3 | s)$, which is called the likelihood of $s$. The normalising constant $P(r_1 r_2 r_3)$ needn't be computed when finding the optimal decoding decision, which is to guess $\hat{s} = 0$ if $P(s = 0 | \mathbf{r}) > P(s = 1| \mathbf{r})$, and $\hat{s} = 1$ otherwise.

To find $P(s = 0| \mathbf{r})$ and $P(s = 1 | \mathbf{r})$, we must make an assumption about the prior probabilities of the two hypothesis $s = 0$ and $s = 1$, and we must make an assumption about the probability of $\mathbf{r}$ given $s$. We assume that the prior probabilities are equal: $P(s = 0) = P(s = 1) = 0.5$; then maximising the posterior probability $P(s | \mathbf{r})$ is equivalent to maximising the likelihood $P(\mathbf{r} | s)$. And we assume the channel is a binary symmetric channel with noise level $f < 0.5$, so that the likelihood is

$$P(\mathbf{r} \mid s) = P(\mathbf{r} \mid t(s)) = \prod_{n = 1}^N P(r_n \mid t_n(s)), \tag{1.21}$$

where $N = 3$ is the number of transmitted bits in the block we are considering, and

$$P(r_n \mid t_n) = \begin{cases}
(1 – f) \qquad & r_n = t_n \\
f \qquad & r_n \not= t_n
\end{cases} \tag{1.22}$$

Page 6, Information Theory, Inference, and Learning Algorithms by David J.C. MacKay

I don't understanding what is meant by "maximising the posterior probability", and therefore, by extension, the following section:

maximising the posterior probability $P(s | \mathbf{r})$ is equivalent to maximising the likelihood $P(\mathbf{r} | s)$.

What does it mean to "maximise" the posterior probability? And why is it necessary in this context? This makes no sense to me, since when I hear of "maximising" something, I think of calculus. And how is maximising the posterior probability $P(s | \mathbf{r})$ equivalent to maximising the likelihood $P(\mathbf{r} | s)$? It seems to me like these would actually be two different things?

I would greatly appreciate it if people could please take the time to clarify this.

Best Answer

Think of $P(s | r)$ as a function of s. Now, without scaling, you compare, say, $P(s = 1 | r) > P(s = 0 | r)$.

How is maximising the posterior probability P(s|r) equivalent to maximising the likelihood P(r|s)?

Expand your posterior terms: $P(s = i | r) = \frac{P(r | s = i)P(s = i)}{P(r)}$

Then:

$P(s = 1 | r) > P(s = 0 | r) \Rightarrow \\ \quad \frac{P(r | s = 1)P(s = 1)}{P(r)} > \frac{P(r | s = 0)P(s = 0)}{P(r)} \Rightarrow \\ \quad \quad P(r | s = 1)P(s = 1) > P(r | s = 0)P(s = 0) \Rightarrow \\ \quad \quad \quad P(r | s = 1) > P(r | s = 0)$

In line 3, we multiplied both side by $P(r)$ since it is a constant (for both sides ofc). In line 4, we simply divided both sides by 0.5, since $P(s = 0) = P(s = 1) = 0.5$. Hence, your posterior comparison is equal to comparison of likelihoods.

It seems to me like these would actually be two different things?

In case your prior were not uniform, ($P(s = 0) \neq P(s = 1)$), YOU WOULDN'T BE ABLE TO DO THAT, and your posterior maximization would NOT be equal to your likelihood maximization.

What does it mean to "maximise" the posterior probability?

In order to maximize, or find the largest value of posterior ($P(s = i | r)$), you find such an $i$, so that your $P(s = i | r)$ is maximum there. In your case (discrete), you would compute both $P(s = 1 | r)$ and $P(s = 0 | r)$, and find which one is larger, it will be its maximum.

why is it necessary in this context?

It is answered in your question:

The optimal coding decision (optimal in the sense of having the smallest probability of being wrong) is to find which value of s is most probable, given rThe optimal coding decision (optimal in the sense of having the smallest probability of being wrong) is to find which value of s is most probable, given r

PS. If you like calculus, maximizing a discrete probability density is like finding a parameter $i$ in $P(s = i | r)$, which is a value of discrete distribution (which is discrete function) that maximizes its value.

Related Question