What is the name of this approach in the literature?
This is called block entropy; however, it has two different interpretations, corresponding to the following two different scenarios. In both scenarios we have a fixed string of symbols $y_1..y_n$, each from a finite alphabet $A$, and $H(X)=-\ E\ \log(p(X))$ denotes the Shannon entropy of a random object $X$ with probability mass function $p(x)$:
Scenario 1. The given string $y = y_1..y_n$ is just a particular value of a random string $Y = Y_1..Y_n$ with values in $A^n$, the unknown distribution of $Y$ being the real target of interest. If $N_Y(w)$ denotes the (random!) number of times the substring $w$ occurs in $Y$, then $\frac{1}{n-|w|+1}N_Y(w)$ has expectation $$\begin{align}
E\left(\frac{1}{n-|w|+1}N_Y(w)\right) &= E\left( \frac{1}{n-|w|+1} \sum_{i=1}^{n-|w|+1} [Y_i..Y_{i+|w|-1}=w] \right)\\
&= \frac{1}{n-|w|+1} \sum_{i=1}^{n-|w|+1} P(Y_i..Y_{i+|w|-1}=w) \\
&= \frac{1}{n-|w|+1}\ (n-|w|+1)\ p(w) \\
&= p(w),
\end{align}$$
assuming the $Y_i$ sequence is strictly stationary (i.e. the probability of subword $w$ in location $i$ does not depend on $i$), so
$$\hat{p}(w) = \frac{1}{n-|w|+1}N_Y(w)$$ is an unbiased estimator of $p(w)$. The $k$-block entropy of the random string $Y$ is defined to be
$$\begin{align}
H_k &= -\ E\ \log\ p(Y_1..Y_k)\\
&= - \sum_{w \in A^k}\ P(Y_1..Y_k = w)\ \log\ P(Y_1..Y_k = w) \\
&= - \sum_{w \in A^k}\ p(w)\ \log\ p(w)
\end{align}$$
and a naive estimator for $H_k$ is obtained by simply replacing $p(w)$ by its unbiased estimator $\hat{p}(w)$:
$$\hat{H}_k = - \sum_{w \in A^k} \hat{p}(w)\ \log\ \hat{p}(w).$$
(It turns out, however, that $\hat{H}_k$ is strongly biased, tending to underestimate $H_k$, and adjustments to the formula are sometimes made to eliminate the bias.) The main point is that in this scenario, $\hat{H}_k$ is not seen as characterising the given string $y$ at all; rather, it is seen as a biased estimator of the $k$-block entropy of the random string $Y$.
Some online "Scenario 1" articles re "block entropy":
Scenario 2. The source of the string $y$ (random or otherwise) is not relevant; rather, the target of interest is the "complexity" or "diversity" of the given string $y$, quantified by the $k$-block entropy $(k\in 0..n)$ of the actually occurring distribution of length-$k$ substrings present in $y$. (This equals the entropy of the probability distribution of a random length-$k$ substring $W$ drawn from $y$, with $P(W=w)= \frac{1}{n-k+1}N_y(w)$.)
The underlying general principle seems to be that the complexity/diversity of a given (nonrandom) object is measured by the entropies of various kinds of random samples from that object, i.e., the entropies of distributions that "actually occur" in the object. To me this seems eminently reasonable, but I find no online sources that explicitly state such a principle. (Such an entropy could be called "an entropy of the object", but that would be completely nonsensical in a "Scenario 1" context.)
Some "Scenario 2" links:
NIST SP 800-22, Revision1a, "A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications"
S. Pincus and B. H. Singer, “Randomness and degrees of irregularity,” Proc. Natl. Acad. Sci. USA. Vol. 93, March 1996, pp. 2083-2088
- S. Pincus and R. E. Kalman, “Not all (possibly) “random” sequences are created equal,” Proc. Natl. Acad. Sci. USA. Vol. 94, April 1997, pp. 3513-3518
- mathSE posting
- statsSE posting
The NIST publication uses $ApEn(m) := H_{m+1}-H_m$ as one of the statistics to test PRNGs (in section $3.12$ "Approximate Entropy Test"), saying that
“$ApEn(m)$ measures the logarithmic frequency with which blocks of length $m$ that are close together remain close together for blocks augmented by one position. Thus, small values of $ApEn(m)$ imply strong regularity, or persistence, in a sequence. Alternatively, large values of $ApEn(m)$
imply substantial fluctuation, or irregularity.”
How can I rigorously combine the partition-based entropy estimates for
varying lengths to obtain a single, overall estimate of the entropy of
the sequence?
In a "Scenario 2" context, each $k$-block entropy could be called an entropy of the string, and their arthmetic mean could be thought of as the average block entropy of the string. (This was done in the table displayed in the first posting cited above.) In a "Scenario 1" context, "entropy of the [nonrandom] string" makes no sense, of course; however, it seems a good question to ask how the $k$-block entropies might be combined -- I'm thinking some sort of weighted arithmetic mean might have desirable sampling properties.
Is this approach related to Lempel-Ziv compression?
In a "Scenario 1" context, there have been papers that relate block entropy to Kolmogorov complexity, and thence to compressibility; e.g., Section 6 of the paper by Lesne (linked above).
Best Answer
You've misunderstood the probability space over which $p$ is defined. The column entropy of a particular index in an MSA is calculated over the distribution of bases at that index in the MSA. In other words, let's represent your MSA by an $m \times n$ matrix $S$ where each row is a sequence $S_i$ in your MSA of length $n$. You can think of $S$ as defining $n$ probability distributions $p_j$ over the (single!) alphabet $\mathcal{X}$ of all 20 amino acid symbols, where
$$p_j(x) := \frac{|\{i \in [m] : S_{i,j} = x\}|}{m}$$
Now to compute the Shannon entropy of the $j$th column, $p_j$ would be the probability mass function $p$ that you use in the definition you provided and $\mathcal{X}$ would be the alphabet. Also, this is indeed what's going on in the github link you provided. Taken from a comment earlier on in the code: