There is this question in the Cover and Thomas book "Elements of Information Theory".
Noise alphabets: Consider the channel Y = X + Z where X = {0, 1, 2, 3} and Z is uniformly distributed over three distinct integer values Z = {z1, z2, z3}.
(a) What is the maximum capacity over all choices of the Z alphabet?
Give distinct integer values z1, z2, z3 and distribution on X achieving this.(b) What is the minimum capacity over all choices for the Z alphabet?
Give distinct integer values z1, z2, z3 and distribution on X achieving this.
The solution given was
(a) Maximum capacity is C = 2 bits. Z = {10, 20, 30} and p(X) = (1/4, 1/4, 1/4, 1/4)
(b) Minimum capacity is C = 1 bit. Z = {0, 1, 2} and p(X) = (1/2, 0, 0, 1/2)
I don't quite understand how to arrive at the solution.
My thoughts for this question are as follows.
The maximum capacity $C = max I(X + Z ; Y)$ over $P_x$. Therefore, $C = H(Y) – H(Y|X+Z)$, and $H(Y|X+Z) = 0$. So I should maximize $H(Y)$, but I'm not sure how to do that.
Alternatively, $C = max I(X; Y-Z) = H(X) – H(X|Y-Z)$. Similarly, $H(X|Y-Z) = 0$, so $max H(X)$ becomes $log 4 = 2$, so the maximum capacity is 2 bits. And this can be achieved when $P_x$ is a uniform distribution, i.e. $P_x = (1/4, 1/4, 1/4, 1/4)$. Is my understanding correct?
I don't understand why those values of Z were chosen, and how the minimum capacity is obtained. For example, why can't the capacity be 0 bits? I know it makes no sense to have a channel and not send any bits, but isn't 0 a valid minimum capacity?
Best Answer
Answering your questions on the minimisation:
We can argue from the structure of the problem that the alphabet set of $Z$ must be $(a, a+1, a+2)$, for some integer $a$ (Ask me about this if this is unclear). Set $a = 0$ for convenience, since the decoder will know $a$, and can always subtract it.
Now, given this noise, we need to find the distribution $p_X$ that maximises $I(X;Y)$. I'll try to keep things fully analytic, no jumps of logic, no inspired guesswork.
$I(X;Y) = H(Y) - H(Y|X) = H(Y) - H(X+Z|X)$
$ = H(Y) - H(Z) = H(Y) - \log_2 3$
So we're looking for the distribution of $X$ : $(p_0, p_1, p_3, p_4)$ that maximises H(Y). There's a dirty old analytic way to do this, using lagrange multipliers. (Note, I'll be doing this in natural logs, because they don't leave a dirty constant. This doesn't make a difference to the maximiser.)
Now, the distribution of $Y$ in terms of that of $X$ is as follows (convince yourself of this):
$\left(p_0/3, (p_0 + p_1)/3, (p_0+p_1+p_2)/3, (p_1+p_2+p_3)/3, (p_2+p_3)/3, p_3/3 \right)$
We want to run the optimisation: $$ \max_{(p_0, p_1, p_2,p_3)} \sum_{y=0}^5 -p_y\ln p_y$$ $$\text{Subject to } p_0 + p_1 + p_2 + p_3 = 1$$
Define $J = \sum_{y=0}^5 -p_y\log p_y + \lambda (p_0 + p_1 + p_2 + p_3 - 1)$
At the optimum:
$$\partial_{p_0} (J) = 0 \implies \log \left( (p_0)(p_1+p_0)(p_2+p_1+p_0) \right) = 3\lambda - 3 \tag{1}$$
$$\partial_{p_1} (J) = 0 \implies \log \left((p_1+p_0)(p_2+p_1+p_0)(p_3+p_2+p_1)\right) = 3\lambda - 3 \tag{2}$$
$$\partial_{p_2} (J) = 0 \implies \log \left((p_2+p_1+p_0)(p_3+p_2+p_1)(p_3+p_2)\right) = 3\lambda - 3 \tag{3}$$
$$\partial_{p_3} (J) = 0 \implies \log \left((p_3)(p_3+p_2)(p_3+p_2+p_1)\right) = 3\lambda - 3 \tag{4}$$
$$\partial_{\lambda}(J) = 0 \implies p_0 + p_1 + p_2 + p_3 = 1 \tag{5}$$
This is not so bad to calculate. Look at all terms with $p_0$ in them. When the product $(p_0+p_1)\log(p_0+p_1)$ is differentiated, differentiating the log will leave you with a 1, and differentiating the outer $p_0 + p_1$ will leave you with just the log.
Now, $(1) - (2) = 0$. This implies:
$$\log \frac{p_0}{p_1 + p_2 + p_3} = 0 \iff p_0 = p_1 + p_2 + p_3 \tag{$\alpha$}$$
$(4) - (3) = 0$. This implies:
$$\log \frac{p_3}{p_0 + p_1 + p_2 } = 0 \iff p_3 = p_0 + p_1 + p_2 \tag{$\beta$}$$
Solve out the two equations above, and feed them into equation $(5)$ to get $p_0 = p_3 = 0.5$
At this, the distribution of $Y$ is uniform over 6 characters.
Therefore, $C = \max\limits_{p_X} I(X;Y) = \log_2 6 - \log _2 3 = 1$. Phew.
Try to see what this is structurally doing. Suppose $X$ was uniform, then under this noise, I would be able to tell when the input was $0$ or $3$ to much greater accuracy than if it was $1$ or $2$. Why? Because whenever $Y= 0$ I know that the input is $0$, and similarly for $Y=5, X=5$. So a 'good' distribution for communication under this noise would be weighed more at $0$ and $3$.
To answer the other questions you raised:
Capacity can be zero, sure. For example, take $Z$ to be uniform over $\{ 1, 2, \dots ,N\}$. Then the capacity heads to $0$ as $n \to \infty$
What you're trying to do here is this: You know that your noise is uniform over 3 letters. You don't know which letters. Suppose the noise was the worst possible. At what rate would I still be able to communicate? This is the minimum capacity. It's the smallest of the maximum rates at which I could talk subject to varying noise.
(From your comments) You've calculated $H(Y|X)$ incorrectly.
$H(Y|X) = H(X+Z|X) = H(Z|X) = H(Z)$
Where the second equality comes from the fact that if I'm given $X$, I can always subtract it from any other random variable with 100% accuracy, and the last equality because $Z$ and $X$ are independent.