The meaning of the optimal input probability distribution for a noisy channel in information theory

information theory

While trying to understand Shannon's information theory, I never quite managed to get an intuitive feeling for the "optimal input ensemble probability distribution" $p^*_x$, which is the distribution that maximizes the mutual information of a channel and gives its capacity

$C = max_{p_x} I(X;Y)$.

Specifically, I'm not sure if $p^*_x$ provides any insight as to how to construct a code which allows us to communicate efficiently through a channel.

Take for example the "noisy typewriter", through which you are able to send all 26 English letters, but for each letter you send, the channel may output either that letter, or the next one, so A can become either A or B, M can become M or N etc. If you calculate the capacity of such a channel using standard methods, you get $C=log_2 13$ bits and $p^*_x$ a uniform distribution over the input alphabet. However, for this channel it's easy to devise an optimal coding strategy – just use every other letter, A, C, E…, because they produce outputs fully distinguishable from each other. Since there are now 13 letters you can send with perfect decodability, you can transmit $log_2 13$ bits per channel use so this strategy achieves the previously calculated capacity. It turns out, that a uniform distribution over every other letter in the alphabet, call it $p^{*'}_x$, ($p_A = 1/13$, $p_B = 0$, $p_C = 1/13$…) also maximizes the capacity. So now we're left with a conundrum – both $p^*_x$ and $p^{*'}_x$ achieve capacity, but the latter gives an understanding of how the channel can be effectively used, and the former doesn't really say much at all (certainly interpreting it as "use all letters equally often" doesn't lead to a efficient coding strategy).

If so, how do you interpret any other capacity achieving distributions for different channels? Or, why this methodology of assigning probabilities to input symbols and calculating the most optimal ones gives us any clue as to how we should go about sending any message of our own through a noisy channel?

Best Answer

If you calculate the capacity of such a channel using standard methods, you get $C=log_2 13$ bits and $p_x$ a uniform distribution over the input alphabet

Not quite. What you get is $I(X;Y)=H(Y)-H(Y|X)=H(Y)-1$ bits, which is bounded by an uniform output distribution $p_y$, with $H(Y)=\log 26 $ bits. Then comes the realization that a uniform $p_y=1/26$ can be obtained by a uniform input $p_x=1/26$ - but this not the only solution, you could alternate the probabilities $P_x(A)=1/13$ $P_x(B)=0$ $P_x(C)=1/13$ etc.

Hence the noisy typewriter has several input distributions that attain the capacity (not a very frequent case). All of them are optimal (in the Shannon sense), but it happens that one of those alternatives leads to a simple (almost trivial) optimal encoding (again, a rather rare occurence).

Related Question