A “normalized likelihood” in syndrome coding

coding-theoryinformation theoryparameter estimationprobability

I am reading the book (available for free from the author's website) "Information Theory, Inference, and Learning Algorithms" by David J.C. MacKay. On p. 328 (p. 340 of the pdf) maximum likelihood decoding of the (7,4) Hamming code is considered and solved by brute-force enumerating posterior probabilities of all codewords. However, I do not quite understand the problem statement which is in terms of "normalized likelihoods".

A codeword $\mathbf{x}$ (4 message bits and 3 parity bits for the (7,4) Hamming code) is sent over some noisy channel. The channel defines (and is defined by) all conditional probabilities $P(\mathbf{y}|\mathbf{x})$. (I am only really interested in symmetric memoryless channels.) When the noisy message $\mathbf{y}$ is received (which may have continuous values, as in the case of BIAWGN channel), it is decoded by finding the $\hat{\mathbf{x}}$ which maximizes the posterior probability, $\hat{\mathbf{x}} = \operatorname{argmax}_{\mathbf{x}}P(\mathbf{x}|\mathbf{y})$, where by Bayes' theorem
$$
P(\mathbf{x}|\mathbf{y}) = \frac{P(\mathbf{y}|\mathbf{x})P(\mathbf{x})}{P(\mathbf{y})}.
$$

In the setting of the book, the received codeword and the channel are both not given, but instead only "normalized likelihoods". I don't quite understand what these are. The prior distribution $P(\mathbf{x})$ is not mentioned, I guess it is implicitly assumed to be uniform.

The seven "normalized likelihoods" given are $(0.1, 0.4, 0.9, 0.1, 0.1, 0.1, 0.3)$. As illustration some ratios of likelihoods are also given (quote):
$$
\frac{P(y_1|x_1=1)}{P(y_1|x_1=0)}=\frac{0.1}{0.9},\quad \frac{P(y_2|x_2=1)}{P(y_2|x_2=0)}=\frac{0.4}{0.6},\quad \text{etc.}
$$

Afaik, there is no reason why $P(y_1|x_1=1) + P(y_1|x_1=0)$ should equal 1 for some given $y_1$. For example, in case of BIAWGN channel, $P(y_1|x_1=1)$ and $P(y_1|x_1=0)$ are normal distributions with same standard deviation, centered at (e.g.) $\pm 1$. They do not add to 1 for almost all values of $y_1$. Is this a consequence of this "normalization" of the likelihoods, whatever it means here? The author also writes: "from the point of view of decoding, all that matters is the likelihood ratio". While this seems intuitively plausible to me, is there a simple way to see that this is true in general from the formula of Bayes' theorem? In the book the likelihoods of all codewords are computed. E.g., for $[0000000]$ MacKay obtains the likelihood $P(\mathbf{y}|\mathbf{x}=[0000000]) = 0.0275562$, which is equal to the product of all single-bit likelihoods $\prod_n P(y_n|x_1=0)$. This works out numerically given above values and assuming $P(y_n|x_1=1) + P(y_n|x_1=0) = 1$ for all $n$ (why?) but it would depend on whatever weird "normalization" of the likelihoods is being used, right? How does one see that the posterior probability would not? (In the book, the posterior probability of a codeword is equal to the likelihood divided by the sum of all codeword likelihoods, consistent with the assumption that the prior is uniform.)

How does one obtain these likelihoods in practice? For the binary symmetric channel BSC(p), $y_1$ can only take values $\{0, 1\}$ and (correct me if I'm wrong) $\frac{P(y_1=1|x_1=1)}{P(y_1=1|x_1=0)}=\frac{1-p}{p}$ while $\frac{P(y_1=0|x_1=1)}{P(y_1=0|x_1=0)}=\frac{p}{1-p}$. In this case, the likelihoods for a given $y_1$ do add up to one. This need not be general though, right?

For BIAWGN($\sigma$), suppose (case 1) in one communication I received a fixed value $a_1$ for $y_1$. Alternatively, (more realistically) I know that my received value is in the interval $I_1$ (case 2) or even itself has a probability distribution due to measurement error, $P(a_1)$ (case 3). How to find the likelihoods in these cases ($\mathcal{N}(\mu, \sigma^2)$ is the gaussian distribution centered at $\mu$ with variance $\sigma^2$)?
$$
\begin{align}
P(y_1|x_1=1) &= \mathcal{N}(1, \sigma^2)(a_1) \qquad \text{(case 1, also cf. eqn. 25.3 in the book)}\\
P(y_1|x_1=1) &= \int_{I_{1}}\mathcal{N}(1, \sigma^2) \qquad \text{(case 2, wrong?)}\\
\qquad ??? \quad \text{(case 3)}\\
\end{align}
$$

Best Answer

Given a channel output $Y_i=y_i$, we compute $P_{X_i|Y_i}(x_i|y_i)$ for $x_i=0$ and $x_i=1$ via Bayes' rule, i.e., $$ P_{X_i|Y_i}(x_i|y_i)= \frac{P_{Y_i|X_i}(y_i|x_i)P_{X_i}(x_i)}{P_{Y_i}(y_i)}. $$ Since $P_{X_i}(0)=P_{X_i}(1)=\frac{1}{2}$ (the common assumption in this context), we have that $$P_{X_i|Y_i}(x_i|y_i)=c\cdot P_{Y_i|X_i}(y_i|x_i)$$ for $x_i=0, 1$, where $c$ is some constant. Since $P_{X_i|Y_i}(0|y_i)+P_{X_i|Y_i}(1|y_i)=1$ and the likelihood $P_{Y_i|X_i}(y_i|x_i)$ is specified by the channel (you can see more details in the last part of my reply), we can find $c$ without computing $P_{Y_i}(y_i)$. However, a more smarter choice of $c$ is $$c=(P_{Y_i|X_i}(y_i|0)+P_{Y_i|X_i}(y_i|1))^{-1}.$$ From here, you should be able to sense why the posterior probability $P_{X_i|Y_i}(x_i|y_i)$ is called the normalized likelihood. To end the first part, I would like to point out this fact: $$ \frac{P_{X_i|Y_i}(0|y_i)}{P_{X_i|Y_i}(1|y_i)}=\frac{P_{Y_i|X_i}(y_i|0)}{P_{Y_i|X_i}(y_i|1)}. $$

Regarding your second question:

The author also writes: "from the point of view of decoding, all that matters is >the likelihood ratio". While this seems intuitively plausible to me, is there a >simple way to see that this is true in general from the formula of Bayes' theorem?

Looking at the decoding problem, our goal is to find a codeword $\mathbf{x}$ in your codebook $\mathcal{C}$ such that $P_{\mathbf{X}|\mathbf{Y}}(\mathbf{x}|\mathbf{y})$ is maximized. For simplicity, I consider memoryless binary-input channels. As codeword are equally likely (by assumption), the decoding task for any given channel output $\mathbf{y}$ is $$ \arg\max_{\mathbf{x}\in\mathcal{C}}P_{\mathbf{X}|\mathbf{Y}}(\mathbf{x}|\mathbf{y}) =\arg\max_{\mathbf{x}\in\mathcal{C}}P_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})=\arg\max_{\mathbf{x}\in\mathcal{C}}\prod_{i=1}^n P_{Y_i|X_i}(y_i|x_i). $$ To solve the above optimization problem, we may compare $P_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x})$ for all pairs of codewords $\mathbf{x}_1$ and $\mathbf{x}_2$: we know $\mathbf{x}_1$ is more likely to be transmitted if $$ \prod_{i=1}^n\frac{P_{Y_i|X_{1,i}}(y_i|x_{1,i})}{P_{Y_i|X_{2,i}}(y_i|x_{2,i})}\ge 1. $$ From this comparison, you should realize why the decoding only relies on the likelihood ratios.

Your last question should be no longer a question. Indeed, this equality $$P_{Y_i|X_i}(y_i|0)+P_{Y_i|X_i}(y_i|1)=1$$ is in general not true, and we really don't mean this (I know the name "normalized likelihood" might confuse you). So, forget about this equality.

Next, how do we compute the likelihood $P_{Y_i|X_i}(y_i|x_i)$? Well, it is given to you once the channel model is determined since $P_{Y_i|X_i}(y_i|x_i)$ is exactly the channel transition probability.

For a given channel output $Y_i=y_i$ of an AWGNC where $Y_i=X_i+N_i$ and $N_i\sim\mathcal{N}(0, \sigma^2)$, we have that $P_{Y_i|X_i}(y_i|x_i)=\mathcal{N}(x_i, \sigma^2)(y_i)$ for $X_i=x_i$. The other cases need to be more specific so that we can discuss further. I hope the above helps. :)

Related Question