[Math] Partition-based entropy of a sequence

entropyinformation theoryrandomrandom variablesstatistics

The entropy $H$ of a discrete random variable $X$ is defined by

$$H(X)=E[I(X)]=\sum_xP(x)I(x)=\sum_xP(x)\log P(x)^{-1}$$

where $x$ are the possible values of $X$, $P(x)$ is the probability of $x$, $E$ is the expected value operator, and $I$ is the self-information. In other words, $H(X)$ is the expected information content of $X$.

Suppose we have a sequence

$$s=\langle s_1,s_2,s_3,s_4,\ldots,s_n\rangle$$

Let $N(x)$ be the number of occurrences of $x$ in $s$. Assuming a uniform distribution, we can express $P(x)$ as $N(x)/n$, where $n$ is the length of the sequence. Hence we can express the entropy as

$$\sum_x\frac{N(x)}{n}\log\frac{n}{N(x)}=\log n-\frac{1}{n}\sum N(x)\log N(x)$$

This allows us to calculate the entropy of the sequence based on the number of occurrences of individual symbols. For example,

$$H(\langle a,b,a,b,a,b\rangle)=1$$

according to this measure. Switching the last two symbols yields the same entropy

$$H(\langle a,b,a,b,b,a\rangle)=1$$

since the number of occurrences of each symbol does not change. However, it seems intuitively true that the second sequence is more "random" than the first sequence and hence should, in some sense, have higher entropy.

Indeed, I found that taking the entropy of adjacent pairs of symbols in $s$, we obtain

$$H(\langle ab,ba,ab,ba,ab\rangle)\approx 0.970951$$
$$H(\langle ab,ba,ab,bb,ba\rangle)\approx 1.52193$$

By this measure, the entropy of the second sequence is higher than that of the first sequence. We can do the same for triplets of symbols as well:

$$H(\langle aba,bab,aba,bab\rangle)=1$$
$$H(\langle aba,bab,abb,bba\rangle)=2$$

and so on. Thus measuring the entropy not only for individual symbols but subsequences of symbols in the sequence appears to yield a better estimate of the "complexity" of the sequence. The same concept can be applied to 2D partitions (see here).

My questions are as follows

  • What is the name of this approach in the literature? I'm thinking it might be something along the lines of partition-based entropy or substring-based entropy.
  • How can I rigorously combine the partition-based entropy estimates for varying lengths to obtain a single, overall estimate of the entropy of the sequence?
  • Is this approach related to Lempel-Ziv compression?
  • Can this approach be extended to continuous functions, perhaps by partitioning through open subsets of the domain of the function?

Edit:

On further thought, it seems a semi-rigorous combined result might be achieved by using the minimum length description principle and establishing a prior for the hypothesis "the alphabet of the source is sequences of symbols of length $n$" as follows:

$$P(\text{source produces words of length }n)=2^{-n}\text{ or }\frac{1}{n(n+1)}$$

which satisfies unitarity:

$$\sum_{n=1}^\infty P(x)=1$$

thus the $n$-partition entropy estimate for each $n$ is weighed by a function of $n$.

Edit: This appears to correspond to the serial test here and the overlapping permutations test here.

Best Answer

What is the name of this approach in the literature?

This is called block entropy; however, it has two different interpretations, corresponding to the following two different scenarios. In both scenarios we have a fixed string of symbols $y_1..y_n$, each from a finite alphabet $A$, and $H(X)=-\ E\ \log(p(X))$ denotes the Shannon entropy of a random object $X$ with probability mass function $p(x)$:

Scenario 1. The given string $y = y_1..y_n$ is just a particular value of a random string $Y = Y_1..Y_n$ with values in $A^n$, the unknown distribution of $Y$ being the real target of interest. If $N_Y(w)$ denotes the (random!) number of times the substring $w$ occurs in $Y$, then $\frac{1}{n-|w|+1}N_Y(w)$ has expectation $$\begin{align} E\left(\frac{1}{n-|w|+1}N_Y(w)\right) &= E\left( \frac{1}{n-|w|+1} \sum_{i=1}^{n-|w|+1} [Y_i..Y_{i+|w|-1}=w] \right)\\ &= \frac{1}{n-|w|+1} \sum_{i=1}^{n-|w|+1} P(Y_i..Y_{i+|w|-1}=w) \\ &= \frac{1}{n-|w|+1}\ (n-|w|+1)\ p(w) \\ &= p(w), \end{align}$$ assuming the $Y_i$ sequence is strictly stationary (i.e. the probability of subword $w$ in location $i$ does not depend on $i$), so $$\hat{p}(w) = \frac{1}{n-|w|+1}N_Y(w)$$ is an unbiased estimator of $p(w)$. The $k$-block entropy of the random string $Y$ is defined to be

$$\begin{align} H_k &= -\ E\ \log\ p(Y_1..Y_k)\\ &= - \sum_{w \in A^k}\ P(Y_1..Y_k = w)\ \log\ P(Y_1..Y_k = w) \\ &= - \sum_{w \in A^k}\ p(w)\ \log\ p(w) \end{align}$$

and a naive estimator for $H_k$ is obtained by simply replacing $p(w)$ by its unbiased estimator $\hat{p}(w)$:

$$\hat{H}_k = - \sum_{w \in A^k} \hat{p}(w)\ \log\ \hat{p}(w).$$

(It turns out, however, that $\hat{H}_k$ is strongly biased, tending to underestimate $H_k$, and adjustments to the formula are sometimes made to eliminate the bias.) The main point is that in this scenario, $\hat{H}_k$ is not seen as characterising the given string $y$ at all; rather, it is seen as a biased estimator of the $k$-block entropy of the random string $Y$.

Some online "Scenario 1" articles re "block entropy":

Scenario 2. The source of the string $y$ (random or otherwise) is not relevant; rather, the target of interest is the "complexity" or "diversity" of the given string $y$, quantified by the $k$-block entropy $(k\in 0..n)$ of the actually occurring distribution of length-$k$ substrings present in $y$. (This equals the entropy of the probability distribution of a random length-$k$ substring $W$ drawn from $y$, with $P(W=w)= \frac{1}{n-k+1}N_y(w)$.)

The underlying general principle seems to be that the complexity/diversity of a given (nonrandom) object is measured by the entropies of various kinds of random samples from that object, i.e., the entropies of distributions that "actually occur" in the object. To me this seems eminently reasonable, but I find no online sources that explicitly state such a principle. (Such an entropy could be called "an entropy of the object", but that would be completely nonsensical in a "Scenario 1" context.)

Some "Scenario 2" links:

The NIST publication uses $ApEn(m) := H_{m+1}-H_m$ as one of the statistics to test PRNGs (in section $3.12$ "Approximate Entropy Test"), saying that

“$ApEn(m)$ measures the logarithmic frequency with which blocks of length $m$ that are close together remain close together for blocks augmented by one position. Thus, small values of $ApEn(m)$ imply strong regularity, or persistence, in a sequence. Alternatively, large values of $ApEn(m)$ imply substantial fluctuation, or irregularity.”

How can I rigorously combine the partition-based entropy estimates for varying lengths to obtain a single, overall estimate of the entropy of the sequence?

In a "Scenario 2" context, each $k$-block entropy could be called an entropy of the string, and their arthmetic mean could be thought of as the average block entropy of the string. (This was done in the table displayed in the first posting cited above.) In a "Scenario 1" context, "entropy of the [nonrandom] string" makes no sense, of course; however, it seems a good question to ask how the $k$-block entropies might be combined -- I'm thinking some sort of weighted arithmetic mean might have desirable sampling properties.

Is this approach related to Lempel-Ziv compression?

In a "Scenario 1" context, there have been papers that relate block entropy to Kolmogorov complexity, and thence to compressibility; e.g., Section 6 of the paper by Lesne (linked above).

Related Question