Shannon Entropy Inequality

entropyinequalityinformation theory

Suppose I observe $n$ data points which generate the empirical distribution $P_n$ with associated point mass probabilities $p_1,\ldots,p_m$ with $m \le n$. Suppose I now observe a new data point which is different from the first $n$, yielding $P_{n+1}$. I feel reasonably confident that $H(P_n) < H(P_{n+1})$ always, but I can't prove it in complete generality nor can I find a counterexample. Any ideas? I can't find a property/theorem to leverage for this situation (1.8 from here looked temporarily promising but doesn't do it).

The question about what happens if the point isn't unique is answered in generality here

Best Answer

Let $x$ be the new sample. By the definition of the empirical distribution,

$P_{n+1} = \frac{n}{n+1} P_n + \frac{1}{n+1} \delta_x,$ where $\delta_x$ is the unit mass at $x$.

Since $x \not\in \mathrm{support}(P_n), $ this is a disjoint mixture of two distributions. As an exercise show that if $P = \lambda Q + (1-\lambda) R$ for $\lambda \in [0,1]$ and distributions $Q,R$ with disjoint support, then $$ H(P) = h_2(\lambda) + \lambda H(Q) + (1-\lambda) H(R),$$ where $h_2$ is binary entropy.

(This is ex. 2.10 in the 2nd edition of Cover and Thomas)

Now, we have $$ H(P_{n+1}) = h_2\left(\frac{1}{n+1}\right) + \frac{n}{n+1} H(P_n) + 0.$$

So the question becomes if it holds that $$\frac{H(P_n)}{n+1} \overset{?}< h_2\left(\frac{1}{n+1}\right)$$

To show this, note first that since the support of $P_n$ has at most $n$ symbols, $H(P_n) \le \log n$. But then $$h_2\left(\frac{1}{n+1}\right) = \frac{\log (n+1)}{n+1} + \frac{n}{n+1}\log \frac{n+1}{n} \ge \frac{\log n+1}{n+1} > \frac{H(P_n)}{n+1}. $$

Related Question