Differential entropy vs Kolmogorov-Sinai “partition trick”

entropyinformation theorymeasure-theory

Shannon entropy is well-defined for probability distributions $p(x)$ on finite (or countable) sets $X$,
\begin{equation}
H_S=-\sum_{x\in X}p(x)\log p(x)\,.
\end{equation}
To compute the entropy of a real-valued variable, many people use differential entropy
\begin{equation}
H_{D}=-\int_{X}p(x)\log p(x)dx\,,
\end{equation}
but this definitions of entropy isn't the continuous analog of $H_S$ and does not retain $H_S$'s non-negativity.

Correctly extending $H_S$ to the case of uncountable $X$ is not that straightforward and requires the use of a limiting density of discrete points, which seems tricky enough that many people ignore this problem and just use differential entropy despite its shortcomings.

Kolmogorov and Sinai defined entropy of a dynamical system. Let $(X,\mathcal B, \mu, T)$ be a dynamical system with probability triple $(X,\mathcal B, \mu)$ and a map $T:X\rightarrow X$ that describes the dynamics on $X$. Its Kolmogorov-Sinai entropy is
\begin{equation}
H_{KSE}=\sup_Q h(T,Q)
\end{equation}
where
\begin{equation}
h(T,Q)=\lim_{N\rightarrow\infty}\frac{1}{N}H_S(\bigvee_{n=0}^NT^{-n}Q),
\end{equation}
is the entropy of a dynamical system with respect to a partition $Q=\{Q_1,Q_2,\dots,Q_k\}$ of $X$ and
\begin{equation}
T^{-n}Q=\{Q_{i_0}\cap T^{-1}Q_{i_1}\cap T^{-2}Q_{i_2}\cap \dots\cap T^{-n}Q_{i_n}\}_{i_0,i_1,\dots,i_n\leq k}
\end{equation}
is an iterative pullback of $T$ on $Q$, that is, the set of all trajectories that $T$ can generate on $Q$ in $n$ time steps.

The set $\bigvee_{n=0}^NT^{-n}Q$ of all trajectories is countable because the partition Q is countable. So Kolmogorov and Sinai circumvented the trouble of having to consider an uncountable number of states (at every time point) by partitioning $X$ into a countable number of sets $Q_i$. By taking the supremum over partitions $Q$, they ensured that one chooses a partition of $X$ that is fine enough to capture all the information that the dynamical system generates at each time step.

Question: Can we define Shannon entropy on uncountable sets $X$ (for example $X=\mathbb R$) as
\begin{equation}
H_S'=\sup_Q\left(-\sum_{Q_i\in Q}p(Q_i)\log p(Q_i)\right)\,,
\end{equation}
where $Q$ is a partition of $X$? If so, why doesn't anybody do that? Or is this in some way equivalent to Jaynes' approach using limiting density of discrete points? Or to differential entropy?

Best Answer

Kolmogorov defined the notion of $\epsilon$-entropy for abstract metric spaces as a general way of defining the entropy in On the Shannon Theory of Information Transmission in the case of continuous signals. The definition here depends on the notion of mutual information.

A better presentation can be found in Renyi's work on the dimension and entropy of probability distributions.

An $\epsilon$-partition of a metric space $\Omega$ is defined as a finite disjoint sets $\{E_i\}$ with diameter less than $\epsilon$ covering the whole space. The minimal number of these sets is given by $N(\epsilon)$. For an $\epsilon$-partition with the number of sets $n(\epsilon)$ such that $$ \lim_{\epsilon \to 0}\frac{\log n(\epsilon)}{\log N(\epsilon)}=1, $$ and the partitions inside the $\sigma$-algebra, define the $\epsilon$-entropy of a measure $\mu$ with respect to the partition $\{E_i\}$ : $$ H_\epsilon(\mu;\{E_i\})=\sum_i \mu(E_i)\log\frac{1}{\mu(E_i)}. $$

As $\epsilon\to 0$, the entropy can blow up as in continuous case. For example for real valued continuous random variables we have $$ \lim_{\epsilon\to 0}\frac{H_\epsilon(\mu;\{E_i\})}{\log\frac 1\epsilon}=1. $$ The above limit, if it exists, is a certain notion of dimension, call it information dimension. For example the information dimension of discrete random variables is zero and continuous random vectors on $\mathbb R^n$ is equal to $n$. This gets more tricky for singular random variables in fractals for instance.

Now roughly speaking if the information dimension is $d$, then the $\epsilon$-entropy scales with $d\log\frac 1\epsilon$ plus a remainder. This remainder can generally depend on the partition used.

However if one considers real-valued random variables and choose intervals for partitioning, then the remainder is interestingly the Shannon entropy for discrete random variables and differential entropy for continuous random variables. So for an appropriately chosen partition, we have: $$ H(\mu)=\lim_{\epsilon\to 0}(H_\epsilon(\mu;\{E_i\})-d\log\frac 1\epsilon) $$ where $$ \lim_{\epsilon\to 0}\frac{H_\epsilon(\mu;\{E_i\})}{\log\frac 1\epsilon}=d. $$ This holds for singular distributions as well.

Related Question