Differential Entropy and “Limiting density of discrete points”

entropystatistics

I stumbled across the concept of differential entropy and was left puzzled by the Wikipedia page Limiting density of discrete points.
The related "talk page" Talk: Limiting density of discrete points added to my confusion.

The first points where I struggled is based on the statement that Shannon's differential entropy is not dimensionally correct, which is objected to in the Talk page.

The differential entropy is given by
$$ h(x) = – \int p(x) \log p(x) \mathrm{d}x $$
The Wiki page argues that as $h$ is to be dimensionless, the probability density must have dimension "$1 / \mathrm{d}x$" , which would result in the logarithm argument not being dimensionless.
This made sense to me actually, as a probability density will not be dimensionless.

Yet, on the talk page (second link above), the following objection is presented.
The differential entropy is the limit as $\Delta \to 0$ of a Riemann sum
$$ -\sum p(x)\Delta \log \big( p(x) \Delta \big) $$ which, so it is argued, is dimensionally consistent as $p(x) * \Delta$ is dimensionless.
The latter can be written as
$$-\sum p(x)\Delta \log \big( p(x) \big) -\sum p(x)\Delta \log ( \Delta ) $$
and, while the second term could be proven to vanish, the first term yields the differential entropy formula.

The first question is, who is right in this argument?

I understand that a term as $\log (A *B) $ where $A$ and $B$ are quantities with dimensions say of length and inverse of length, can be written as $\log(A) + \log(B)$. The argument of the logarithm is not dimensionless, but the whole expression is ultimately at least invariant to a change of units. Yet, in the manipulation presented in the second link, a term vanishes, and the expression is not independent of the units chosen.

The second question I have concerns the "invariant measure" $m(x)$ used by Jaynes to correct the differential entropy formula, leading to the expression
$$ H(x) = – \int p(x) \log \frac {p(x)}{m(x)} \mathrm{d}x $$
I understand how this expression is now dimensionally consistent. Yet its consequences seem quite strange to me.
For example, if a uniform distribution with support over $[a,b]$ is considered, the differential entropy equals $\log[b-a]$: it depends on the support length, as seems meaningful. IT does however also depend on the system of units chosen to measure length).
To use Jaynes's equation, I believe a legitimate choice is $$m(x) = \frac{1}{b-a}$$ Then, Jaynes's entropy turns out to equal $0$, regardless of the support length.
The second question then, is this conclusion of mine correct? Would not any constant value do the job, as far as dimensionality is concerned, in lieu of $m(x)$?

I did read some original papers from Jaynes, but cannot work it out. Adapting, to the best of my understanding, his reasoning to the uniform distribution case I mentioned before, he starts from the discrete entropy expression

$$ H_{d} = -\sum_{i} p_i \log(p_i) $$
over discrete points $x_1, x_2, \dots, x_n$
Further by noting
$$\lim_{n \to \infty} n (x_{i+1}-x_i) = b-a$$
he writes
$$p_i = p(x) \frac{1}{n m(x_i)} $$ which again for the uniform distribution I translate as
$$ p_i = \frac{1}{n} = \frac{1}{b-a} \frac{b-a}{n}$$
which does make sense, yet it yields once the sum is turned to an integral, to the term $\log(\frac{p(x)}{n m(x)}) $ which as I said, seems equal to zero for any uniform distribution.

I would be most grateful for a clarification. Sorry for the verbosity but I hope that by adding all the passages it will be easier to pinpoint my mistake.

Best Answer

On your first question: The comment on the talk page is wrong. It argues that while the first sum $\sum p(x)\Delta\log p(x)$ becomes a finite integral $\int p(x)\log p(x)\mathrm dx$, the second sum $\sum p(x)\Delta\log\Delta$ goes to zero because $\Delta\log\Delta\to0$ as $\Delta\to0$. This is clearly wrong; if the argument were valid, it would also show that the first sum goes to zero. The fact that the infinitesimal contributions necessarily go to zero as we take the limit of a sum to obtain an integral doesn’t imply that the entire sum goes to zero. To the contrary, $\sum p(x)\Delta\log\Delta$ grows without limit as $\Delta\to0$ and the number of terms in the sum grows as $\Delta^{-1}$.

On your second question: Your conclusion is right; the reasonable choice $m(x)=\frac1{b-a}$ leads to zero entropy for any length. This is not as strange as it might seem. After all, we’re throwing out an infinite contribution to the entropy. In this context, the absolute value of the entropy is not meaningful; only changes in entropy are meaningful. For instance, if you perform an experiment and your posterior distribution is uniform over the first half of $[a,b]$, with $m(x)$ unchanged, the entropy is now

$$ -\int_a^\frac{a+b}2\frac2{b-a}\log\frac{\frac2{b-a}}{\frac1{b-a}}\mathrm dx=-\log2\;, $$

so you’ve gained one bit of information, which is a reasonable conclusion – for instance, you now need one yes/no question less on average to pin the value down to a certain accuracy. Since you need infinitely many such questions to find the actual value, you’re missing an infinite amount of information, so it makes sense that the “unrenormalized” entropy is infinite. Compare this to the situation in quantum field theory, where infinities in the perturbation series cause unrenormalized energies (e.g. the vacuum energy of the ground state) to be infinite, so these infinities are subtracted out, and differences between the remaining finite energies correspond to actual energy differences.

Related Question