Shannon entropy of languages

entropyformal-languagesinformation theoryprobabilitystatistics

In his paper Prediction and Entropy of Printed English Shannon defines the entropy $H$ of a language as

$$H = \lim_{N \to \infty} F_N$$

where $$F_N = \sum_{i, j} p(b_i, j) \log p(j | b_i)$$

where $b_i$ is a sequence and $j$ is a word.

The interpretation is: $F_N$ computes the surprise of the average continuation $j$ of a sequence $b_i$ (of length $N-1$) weighted by the probability of such a sequence $(b_i, j)$ occurring in the first place.

Shannon further states that

$$F_N = \sum_{i, j} p(b_i, j) \log p(j | b_i) = \sum_{i, j} p(b_i, j) \log p(b_i, j) + \sum_i p(b_i) \log p(b_i)$$

We arrive at that equivalence as follows:

\begin{align}
\sum_{i, j} p(b_i, j) \log p(j | b_i) & = – \sum_{i, j} p(b_i, j) \log \frac{p(b_i, j)}{p(b_i)}\\
& = – \sum_{i, j} p(b_i, j) ( \log p(b_i, j) – \log p(b_i) ) \\
& = – \sum_{i, j} p(b_i, j) \log p(b_i, j) – p(b_i, j) \log p(b_i) \\
& = – \left(\sum_{i, j} p(b_i, j) \log p(b_i, j) – \sum_{i, j} p(b_i, j) \log p(b_i)\right) \\
& \text{marginal distribution in second sum}\\
& = – \sum_{i, j} p(b_i, j) \log p(b_i, j) + \sum_{i} p(b_i) \log p(b_i) \\
\end{align}

Best Answer

My question was about the interpretation of the formula. I answered it myself while writing the question:

The interpretation is: $F_N$ computes the surprise of the average continuation $j$ of a sequence $b_i$ (of length $N-1$) weighted by the probability of such a sequence $(b_i, j)$ occurring in the first place.

But since references to this paper are sparse, and I only found a very badly scanned version that is hard to read, maybe this helps somebody.

Related Question