We can find some attempts for generalizing the notion of entropy of discrete random variables to random variables with general distribution function.
A straightforward way is to employ Riemann sum of the distribution function. So we start with a discrete random variable and then by making the intervals small enough, the entropy function is obtained. Denote the quantized random variable by $X_\delta$ where $\delta$ is the size of the intervals. If the probability density function $f$ is integrable, we can see for small $\delta$ (Cover-Thomas p. 248):
$$
H(X_\delta)\approx h(X)-\log\delta.
$$
By choosing $\delta$ equal to $2^{-n}$, i.e. $n$ bit quantization, we get
$$
H(X_\delta)\approx h(X)+n
$$
which represents how many bits we need to describe $X$ with $n$ bit accuracy.
This shows somehow the relation between differential entropy and discrete entropy. Note that when $\delta\to 0$, $H(X_\delta)\to\infty$.
Another point is that the mutual information does not change using this method, namely if $\delta\to 0$:
$$
I(X_\delta;Y_\delta)=I(X;Y).
$$
The generalization is attributed to different people, among them mainly Kolmogorov and Rényi:
A. N. Kolmogorov. On the Shannon theory of information transmission in
the case of continuous signals. IRE Trans. Inf. Theory, IT-2:102–108, Sept.
1956.
J. BALATONI and A. RENYI, Remarks on entropy (in Hungarian with English and
Russian summaries), Publications of the Mathematical Institute of the Hungarian
Academy of Sciences, I (1956), pp. 9--40.
Renyi introduced the following random variable ($[]$ is the integer part)
$$
X_n=\frac1{n}[nX].
$$
Note that this is nothing but looking at the intervals $[\frac kn,\frac{k+1}n)$.
Suppose that $H([X])$ exists, which is denoted by $H_0(X)$ in the original paper. The lower dimension of $X$ is defined as following
$$
\underline d(X)=\liminf_{n\to\infty}\frac{H([X])}{\log n}
$$
and upper dimension of $X$ as:
$$
\overline d(X)=\limsup_{n\to\infty}\frac{H([X])}{\log n}.
$$
Now if $\overline d(X)=\underline d(X)$, we simply talk about the information dimension of $X$, $d(X)$ and we define the following:
$$
H_{d(X)}(X)=\lim_{n\to\infty} (H(X_n)-d(X)\log n).
$$
Renyi proved that if $X$ has an absolutely continuous distribution with the density function $f(X)$ and finite $H([X])$, then we can say:
$$
d(X)=1\\
H_1(X)=h(X).
$$
This is what we discussed above for $\delta=\frac 1n$:
$$
H(X_n)=h(X)+\log n
$$
Kolmogorov instead introduced the notion of $\epsilon-$entropy which is defined for random variables in abstract metric spaces which is more general.
To answer your question, We can keep the same intuition as the discrete case for differential entropy at least when we use it for finding mutual information or KL-divergence.
For the entropy itself, we have to alter our intuition a little bit. The entropy of discrete random variable means the minimum bits we need to compress the random variable. But for random variables with uncountable supports, we can always "compress" it with another uncountable set of same cardinality (any one-to-one and onto mapping does that). But different random variables with uncountable supports can have different differential entropy.
Does anyone know if the Gumbel can occur as a limit distribution for such a sum?
When we have $n$ exponential distributed variables $X_i \sim Exp(\gamma = i)$, (with expectation $1/i$ and variance $1/i^2$) then the sum
$$S = \sum_{i=1}^n (X_i - 1/i)$$
approaches a Gumbel distribution.
There is a connection between this sum and the maximum order statistic.
We can see this sum as the waiting time for filling $n$ bins when the filling of the bins is a Poisson process.
- Approach with the sum. The waiting time between the filling of bins bin is exponential distributed. For waiting until one bin is filled, since all bins are empty the rate is $n$. The waiting time for a second bin to be filled is when $n-1$ bins are empty and the rate will be $n-1$, and so on...
- Approach with the maximum. We can consider the waiting times for filling each individual bin. The waiting time to fill all bins is equal to the maximum of the individual waiting times.
The distribution of the maximum of exponential distributed variables approaches a Gumbel distribution. Therefore the expression in terms of a sum, which has an equal distribution, will also approach the Gumbel distribution.
See also Intuition about the coupon collector problem approaching a Gumbel distribution on Cross Validated.
This is of course not general.
If we use $X_i = N(\mu = 1/i, \sigma^2 = 1/i^2)$ then a (properly scaled) sum will approach a normal distribution.
That is a trivial example but there are more cases that will converge to a normal distribution. The relevant condition that needs to be fulfilled is the Lyapunov condition.
Best Answer
There is a book on the subject: "Information Theory and The Central Limit Theorem" by Oliver Johnson. The article by Anshelevich mentioned by Yemon considers the operator $T$ acting on probability densities and corresponding to going from the law of a random variable $X$ to that of $(X+Y)/\sqrt{2}$ where $Y$ is an independent copy of $X$. The entropy is a Lyapunov function for this transformation which is the simplest example of a renormalization group transformation. The $N(0,1)$ is a fixed point and it is easy to diagonalize the linearization of $T$ near this fixed point using Wick monomials, i.e., Hermite polynomials. The directions corresponding to 0-th, 1-st and 2-nd moments are expanding (relevant operators) or neutral (marginal operators) while all others are contracting (irrelevant operators). Therefore if one makes the necessary arrangements (renormalization conditions) to fix these moments (e.g. subtracting $N$ times the mean and dividing by $\sqrt{N}$) then one lies on the stable manifold of the Gaussian fixed point. See the textbook on probability theory by Koralov and Sinai for more details. The generalization of the $T$ map for joint probability distributions of dependent variables, i.e., the renormalization group is explained in the book "A Renormalization Group Analysis of the Hierarchical Model in Statistical Mechanics" by Collet and Eckmann. The issue with using this type of nonlinear transformations is that the above diagonalization at a fixed point only gives information about the vicinity of that fixed point. To get results far away, having a Lyapunov function like the entropy is of great importance. This is an active area in physics which investigates generalizations of Zamolodchikov's $c$-"theorem" in conformal field theory. See for instance this article for a recent review. Entanglement entropy seems to be the Lyapunov function in this setting.