Does the fraction of distinct substrings in prefixes of the Thue–Morse sequence of length $2^n$ tend to $73/96$

combinatorics-on-wordsconjecturesdiscrete mathematicsexperimental-mathematicssequences-and-series

Recall that the Thue–Morse sequence$^{[1]}$$\!^{[2]}$$\!^{[3]}$ is an infinite binary sequence that begins with $\,t_0 = 0,$ and whose each prefix $p_n$ of length $2^n$ is immediately followed by its bitwise complement (i.e. obtained by flipping $0\to1$ and $1\to0$):
$$
\begin{array}{c|cc}&t_0&t_1&t_2&t_3&t_4&t_5&t_6&t_7&\!\!\!\dots\\\hline
p_0&0\\
p_1&0&\color{red}1\\
p_2&0&1&\color{red}1&\color{red}0\\
p_3&0&1&1&0&\color{red}1&\color{red}0&\color{red}0&\color{red}1\\
\cdots&\cdots\!\!
\end{array}
$$

We are interested in contiguous substrings of these prefixes. For a string $\mathcal{S}$ of length $\ell$, the total number of its substrings, including the empty substring $\langle\unicode{x202f}\rangle$ and the string $\mathcal{S}$ itself, is $(\ell^2+\ell+2)/2.$ Hence, the total number of substrings in $p_n$ is $(4^n+2^n+2)/2.$ Clearly, not all of those substrings are distinct for $n>1$. For example, $p_2 = \langle0\,1\,1\,0\rangle$ has $11$ substrings in total, but only $9$ distinct substrings:
$$
\begin{array}{l|cc}&\langle\!\!\!&0&\color{#808080}1&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle\\\hline
1&\langle\!\!\!&&&&&\!\!\!\rangle\\\hdashline
2&\langle\!\!\!&0&&&&\!\!\!\rangle\\
&\langle\!\!\!&&&&\color{#c8c8c8}0&\!\!\!\rangle\\\hdashline
3&\langle\!\!\!&&\color{#808080}1&&&\!\!\!\rangle\\
&\langle\!\!\!&&&\color{#b8b8b8}1&&\!\!\!\rangle\\\hdashline
4&\langle\!\!\!&0&\color{#808080}1&&&\!\!\!\rangle\\
5&\langle\!\!\!&&\color{#808080}1&\color{#b8b8b8}1&&\!\!\!\rangle\\
6&\langle\!\!\!&&&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle\\
7&\langle\!\!\!&0&\color{#808080}1&\color{#b8b8b8}1&&\!\!\!\rangle\\
8&\langle\!\!\!&&\color{#808080}1&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle\\
9&\langle\!\!\!&0&\color{#808080}1&\color{#b8b8b8}1&\color{#c8c8c8}0&\!\!\!\rangle
\end{array}
$$

Among these, $\langle0\rangle$ and $\langle1\rangle$ appear in $p_2$ twice, so the fraction of distinct substrings in $p_2$ is $\,\stackrel9{}\!\!\unicode{x2215}_{\!\unicode{x202f}11}\!.$

Can we find a simple general formula for $\mathscr D_n$, the number of distinct substrings in $p_n$? Let's try to compute a few terms:
$$2,\,4,\,9,\,28,\,101,\,393,\,1561,\,6233,\,24921,\,99673,\,398681,\,1594713,\,6378841,\,\dots$$
These few terms can be computed by a brute-force approach, but using Coolwater's program from here we can compute hundreds of thousands more. It is not too difficult to discover that for $n>2$ all known terms match a simple formula:
$$\mathscr D_n\stackrel{\color{#d0d0d0}?}=\frac{73\cdot4^n+704}{192}\color{#d0d0d0}{,\,\,\text{for}\,\,n>2}\tag{$\diamond$}$$
Somewhat oddly, the three initial terms $\mathscr D_0=2,\,\mathscr D_1=4,$ and $\mathscr D_2=9$ do not match the general formula $(\diamond)$, which results in non-integer rational values for these indexes. I conjecture that the general formula $(\diamond)$ is valid for all $n>2$.

$$\bbox[LemonChiffon]{\begin{array}{c}
\\
\hspace{1in}\text{Could you suggest a way to prove this conjecture?}\hspace{1in}\\
\vphantom.
\end{array}}$$

If the conjecture turns out to be true, then we have a remarkable corollary that for $n\to\infty$ the fraction of distinct substrings in the prefixes $p_n$ tends to a quite surprising limit:

$$\mathscr L=\lim_{n\to\infty}\frac{73\cdot4^n+704}{192}{\large/}\frac{4^n+2^n+2}2=\frac{73}{96}.\tag{$\small\spadesuit$}$$

Best Answer

It should be easy to derive the conjecture from the results of [1]. In particular, Brlek gives in Proposition 4.2 the precise value of the number $P(n,m)$ of factors of length $m$ of $p_n$ (up to the empty word, which is not included). But more interestingly, he gives a table of the small values of $P_n(m)$. Here is this table (I added the empty word in the first column): \begin{array}{c|cc} n \backslash m & 0& 1 & 2 & 3 & 4 & 5 &6 & 7 & 8 & 9 & 10 & 11 & 12 &13 &14 &15 &16 &17 &18 &19 &20 &21 \\ \hline 1&1&2&1\\ 2&1&2&\mathbf{3}&2&1\\ 3&1&2&4&\mathbf{6}&5&4&3&2&1\\ 4&1&2&4&6&10&\mathbf{12}&11&10&9&8&7&6&5&4&3&2&1\\ 5&1&2&4&6&10&12&16&20&22&\mathbf{24}&23&22&21&20&19&18&17&16&15&14&13&12& \dotsm\\ 6&1&2&4&6&10&12&16&20&22&24&28&32&36&40&42&44&46&\mathbf{48}&47&46&45&44& \dotsm \end{array}

As you can see, there are two types of coefficients in this table. Starting from the coefficients in bold, in position $(k, 2^{k-2} + 1)$ for $k > 0$ (that is $\mathbf{6}$, $\mathbf{12}$, $\mathbf{24}$, $\mathbf{48}$, etc.) the coefficients decrease by $1$ in each line. Thus it is easy to take the sum of these coefficients.

The other coefficients, apart from the first values of $m$, also follow a regular pattern. One has $P(n,m) = P(n-1,m)$ for $m \leqslant 2^{n-3}$. Then the coefficients between $P(n, 2^{n-3} + 1)$ and $P(n, 2^{n-3} + 2^{n-4} + 1)$ form an arithmetic progression of reason $4$ (see $24, 28, 32, 36, 40$ in line 6) and then the coefficients between $P(n, 2^{n-3} + 2^{n-4} + 1)$ and $P(n, 2^{n-2} + 1)$ form an arithmetic progression of reason $2$ (see $40,42,44,46,48$ in line 6).

I am a bit lazy to make the complete computation but, with these observations in hand, it should not be too difficult to sum up the coefficients in each line to get the value of ${\cal D}_n$.

[1] S. Brlek, Enumeration of factors in the Thue-Morse word, Discrete Applied Math. 24 (1989), 83-96.

Related Question