Shannon entropy of languages

In his paper Prediction and Entropy of Printed English Shannon defines the entropy $H$ of a language as

$$H = \lim_{N \to \infty} F_N$$

where $$F_N = \sum_{i, j} p(b_i, j) \log p(j | b_i)$$

where $b_i$ is a sequence and $j$ is a word.

The interpretation is: $F_N$ computes the surprise of the average continuation $j$ of a sequence $b_i$ (of length $N-1$) weighted by the probability of such a sequence $(b_i, j)$ occurring in the first place.

Shannon further states that

$$F_N = \sum_{i, j} p(b_i, j) \log p(j | b_i) = \sum_{i, j} p(b_i, j) \log p(b_i, j) + \sum_i p(b_i) \log p(b_i)$$

We arrive at that equivalence as follows:

\begin{align}
\sum_{i, j} p(b_i, j) \log p(j | b_i) & = – \sum_{i, j} p(b_i, j) \log \frac{p(b_i, j)}{p(b_i)}\\
& = – \sum_{i, j} p(b_i, j) ( \log p(b_i, j) – \log p(b_i) ) \\
& = – \sum_{i, j} p(b_i, j) \log p(b_i, j) – p(b_i, j) \log p(b_i) \\
& = – \left(\sum_{i, j} p(b_i, j) \log p(b_i, j) – \sum_{i, j} p(b_i, j) \log p(b_i)\right) \\
& \text{marginal distribution in second sum}\\
& = – \sum_{i, j} p(b_i, j) \log p(b_i, j) + \sum_{i} p(b_i) \log p(b_i) \\
\end{align}

Best Answer

Related Question

Best Answer

Related Solutions

[Math] Does it make sense to calculate the KL-divergence between a joint distribution and a marginal distribution

Measuring the Shannon entropy of an ordered sequence

Related Question