Probability – Limit Distribution of Longest Runs in Random Binary Sequence

combinatoricsprobabilityprobability distributions

Let $X_n$ be the number of longest runs in a sequence of $n$ i.i.d. Bernoulli(1/2) random variables. Here, "run" means a nonempty block of adjacent $1$s; e.g., in the sequence $110110001$, there are two longest runs (each being $11$).

By using a program from OEIS, the limiting distribution of $X_n$, as $n\to\infty,$ is "empirically" suggested to be
$$p_k:=\lim_{n\to\infty}P[X_n=k]={1\over k\ 2^k\ \log(2)}\ \ (k=1,2,3,…)$$

Question: As this is probably a known result, can someone cite a published proof? (Or provide any information on how to prove it.)

"Derivation":
As described elsewhere, for $n\ge 1$, there is a bijection between (a) the compositions of $n$ with exactly $k$ largest parts, and (b) the binary strings of length $n-1$ with exactly $k$ longest runs of $1$s, such that these are equal in number. This number is listed as $T(n,k)$ at OEIS A238341, where a program for $T(n,k)$ is also given. (I've posted a separate question asking for explanation/sources of the algorithm implemented by this program.)

NB: Letting $A(n,k)$ be the number of length-$n$ binary strings with $k$ longest runs, and $T(n,k)$ be the number of compositions of $n$ with exactly $k$ largest parts, there is an "edge effect" due to which these are related as follows:

$$A(n,k)=\begin{cases}
1&\text{if $k=0$}\\[1ex]
T(n+1,k) &\text{otherwise}
\end{cases}$$

and it is to be noted that in fact $A(n,k)=0$ if $k\gt\lceil{n\over 2}\rceil$.

Thus, we have
$$P[X_n=k]={A(n,k)\over 2^n} \ \ (k=0,1,2,3,…)$$

Using the OEIS program gives $P[X_n=1]\approx 0.7213$ for $n\gt 100,$ which led me to guess that $p_1={1\over 2 \log 2}$. I then let $c_k:= p_k\ \log 2$, which implies $\sum_{k=1}^\infty c_k=\log 2;$ i.e., a series converging to $\log 2.$ One of the best-known such series is $c_k={1\over k\ 2^k},$ which is strongly supported by computing $(P[X_n=k] \log 2)_{n\gt 100}\approx(1/2,1/8,1/24,1/64, 1/160,…)=({1\over k\ 2^k})_{k=1,2,…}.$


A Python/SageMath rendition of the OEIS program is online at SageMathCell. It takes about a minute there to plot $(P[X_n=k])_{n=1..45}$ for $k=0,1,2,3$.


Aside:
(Sorry for the length, but I thought someone might be interested …)

@kodlu's comment has led me to show that the limit distribution of the number of longest runs in a random binary sequence of length $n\to\infty$ is the same distribution for all of the following different definitions of "run":

  • D1: "run" means a nonempty block of adjacent $1$s.
  • D2: "run" means a nonempty block of adjacent $1$s, or a nonempty block of adjacent $0$s.
  • D3: "run" means a nonempty block of adjacent $1$s, flanked on each side either by a $0$ or by the start/end of the sequence.
  • D4: "run" means a nonempty block of adjacent $1$s, flanked on each side either by a $0$ or by the start/end of the sequence, or a nonempty block of adjacent $0$s, flanked on each side either by a $1$ or by the start/end of the sequence.

Proof:

Clearly, D1 and D3 give the same distribution for all $n$ because the longest run is the same by either definition (hence the limit distributions are the same). Similarly, D2 and D4 give the same distribution for all $n$ (though different from those for D1 and D3).

Therefore, all that remains to show is that the limit distribution for D1 is the same as that for D2. To do so, let $A(n,k)$ (resp. $B(n,k)$) denote the number of length-$n$ binary strings with exactly $k$ longest runs, according to definition D1 (resp. D2), and let $X_n(s)$ (resp. $Y_n(s)$) denote the number of longest runs in sequence $s$, according to definition D1 (D2, resp.). We will show that $$B(n,k)=2\ A(n-1,k) \ \ (0\lt k\lt n)$$
from which it follows that for $0\lt k\lt n$,
$$P[Y_n=k] = {B(n,k)\over 2^n}={A(n-1,k)\over 2^{n-1}}=P[X_{n-1}=k]
$$

and hence
$$\lim_{n\to\infty}P[Y_n=k]=\lim_{n\to\infty}P[X_n=k]\ \ (k=1,2,3,\dots).$$

(We can dispense with $k=0$, because $P[Y_n=0]=0$ and $P[X_n=0]=2^{-n}\to 0.$)

Thus proceeding, with $s[1]$ denoting the first element of $s$, we have, for $0\lt k\lt n$:
$$\begin{align}B(n,k)
&\overset{1}{=}|\{s\in\{0,1\}^n:\ Y_n(s)=k\}|\\
&\overset{2}{=}2\ |\{s\in\{0,1\}^n:\ Y_n(s)=k,\ s[1]=0\}|\\
&\overset{3}{=}2\ |\{s\in\{0,1\}^{n-1}:\ X_{n-1}(s)=k\}\\
&\overset{4}{=}2\ A(n-1,k)
\end{align}$$

Eq.1 implies Eq.2 because there is a bijection between the sequences beginning with $0$ and those beginning with $1$, pairing each $s$ to its complement $\bar s$ (i.e. with all bits complemented), such that $Y_n(s)=Y_n(\bar s).$ (Any run in $s$ exactly matches a run of the complementary symbol in $\bar s.$ )

To show that Eq.2 implies Eq.3, denote the corresponding sets (suppressing $k$) as
$$T_n:=\{s\in\{0,1\}^n:\ Y_n(s)=k,\ s[1]=0\}\\[1ex] S_{n-1}:=\{s\in\{0,1\}^{n-1}:\ X_{n-1}(s)=k\}$$

Now, any element of $T_n$ can be written, with unique positive integers $(n_1,…,n_j)$ summing to $n$, in the usual "run-length encoding" form
$$t = \ 0^{n_1}1^{n_2}0^{n_3}1^{n_4}\dots b^{n_j}, \ \ b=(j+1)\bmod 2$$

and any element of $S_{n-1}$ can be written, with unique positive integers $(n_1,…,n_j)$ summing to $n$, in the special form
$$s = \ 1^{n_1-1}01^{n_2-1}01^{n_3-1}\dots01^{n_j-1}$$

thus setting up the bijections

$$\underbrace{0^{n_1}1^{n_2}0^{n_3}1^{n_4}\dots b^{n_j}}_{\in T_n}\leftrightarrow(n_1,\dots,n_j)\leftrightarrow\underbrace{1^{n_1-1}01^{n_2-1}01^{n_3-1}\dots01^{n_j-1}}_{\in S_{n-1}}$$

Now, $Y_n(0^{n_1}1^{n_2}0^{n_3}1^{n_4}\dots b^{n_j})=X_{n-1}(1^{n_1-1}01^{n_2-1}01^{n_3-1}\dots01^{n_j-1})=$ the multiplicity of $\max(n_1,…,n_j)$, the sole exception being when $n_1=\dots=n_j=1$ (which occurs iff $k=0$ or $k=n$; i.e. when $Y_n(0101…)=n\ne 0=X_{n-1}(000…)).$

Consequently, $|T_n| = |S_{n-1}|$ for $0\lt k\lt n$.
QED

Best Answer

Not really an answer, rather a reference.

Let it be clear that we are considering the longest runs of ones in $n$ fair coins. Using this approach, we can assume the runlengths correspond to iid geometric ($p=1/2$) variables. Because the mean of each one is $2$, we will have approximately near $n/2$ such variables, of which half are ones.

Then we can consider $t=n/4$ iid geometric variables $Y_j \in \{1,2 \cdots\}$ with $p=1/2$.

Letting $m=\max_{1 \le j \le t}(Y_j)$ be its maximum, we are interested in the distribution of the random variable $Z= \sum_{j} [Y_j = m]$ as $t\to \infty$.

(BTW: this formulation implies that, if the distribution exists, it does not matter if we consider the runs of ones, zeros, or both - and that the border effects are negligible).

This is quite a tough problem. The reference paper seems to be Brands, J. J. A. M.; Steutel, F. W.; Wilms, R. J. G., On the number of maxima in a discrete sample, Stat. Probab. Lett. 20, No. 3, 209-217 (1994). ZBL0802.60048. You already found a non-paywalled preprint from 1992 here.

There it's proved that, for our case, there is not a limit distribution of $Z$, there is an oscillation that does not fully vanish. But, numerically, the oscillation is small, so the distribution "almost" converges to a logarithmic distribution:

$$ P(Z=k) \approx \frac{1}{k \, 2^k \log 2}$$

as you conjectured. This corresponds to formula $3.6$ in the original paper, $3.3$ in the 1992 preprint.

Related Question