Information Theory – What is Empirical Entropy in Information Theory

entropyinformation theory

In the definition of jointly typical sets (in "Elements of Information Theory", ch. 7.6, p. 195), we use

$$-\frac{1}{n} \log{p(x^n)}$$ as the empirical entropy of an $n$-sequence with $p(x^n) = \prod_{i=1}^{n}{p(x_i)}$. I never came across this terminology before. It is not defined explicitly anywhere according to the index of the book.

My question basically is: Why is empirical entropy not $-\sum_{x}{\hat p (x) \log(\hat p(x))}$ where $\hat p(x)$ is the empirical distribution?

What are the most interesting differences and similarities between these two formulas? (in terms of properties they share/do not share).

Best Answer

If the data is $x^n = x_1 \ldots x_n$, that is, an $n$-sequence from a sample space $\mathcal{X}$, the empirical point probabilities are $$\hat{p}(x) = \frac{1}{n}|\{ i \mid x_i = x\}| = \frac{1}{n} \sum_{i=1}^n \delta_x(x_i)$$ for $x \in \mathcal{X}$. Here $\delta_x(x_i)$ is one if $x_i = x$ and zero otherwise. That is, $\hat{p}(x)$ is the relative frequency of $x$ in the observed sequence. The entropy of the probability distribution given by the empirical point probabilities is $$H(\hat{p}) = - \sum_{x \in \mathcal{X}} \hat{p}(x) \log \hat{p}(x) = - \sum_{x \in \mathcal{X}} \frac{1}{n} \sum_{i=1}^n \delta_x(x_i) \log \hat{p}(x) = -\frac{1}{n} \sum_{i=1}^n \log\hat{p}(x_i).$$ The latter identity follows by interchanging the two sums and noting that $$\sum_{x \in \mathcal{X}} \delta_x(x_i) \log\hat{p}(x) = \log\hat{p}(x_i).$$ From this we see that $$H(\hat{p}) = - \frac{1}{n} \log \hat{p}(x^n)$$ with $\hat{p}(x^n) = \prod_{i=1}^n \hat{p}(x_i)$ and using the terminology from the question this is the empirical entropy of the empirical probability distribution. As pointed out by @cardinal in a comment, $- \frac{1}{n} \log p(x^n)$ is the empirical entropy of a given probability distribution with point probabilities $p$.

Related Question