[Math] Finding a specific sequence of digits in pi

entropypiprobabilitysequences-and-series

Looking at the pifs project on GitHub and this question on SO has made me curious as to how feasible it is to find a specific sequence of digits within Pi.

Essentially, on average, how many digits of pi would you have to have to go through to find a specific sequence of a given length?

Assuming the digits of pi to be normal and otherwise random, I would assume that the difficulty of finding a specific sequence increases exponentially(?) as the length of the sequence increases. For instance, just trying to find a short sequence like '12345' is fairly easy, occurring at position 49702. However, just one more digit, say, '123456', brings you up to position 2458885. Of course, it differs by specific sequence; '314159' is much easier to find in terms of the work you have to do (especially computationally, where runtimes and memory usage make a difference). There's probably still a way to find an average digit of pi on which any given sequence would start, but it's beyond the scope of my mathematical ability.

Another interesting question is whether there's a specific formula for average position in pi for a sequence of a given length. For instance, if you took a specific 256 kb sequence, it would take an absurd number of digits of pi to find it, but how would that absurd number compare the the far more absurd number of digits you would need to find a specific 512 kb sequence in pi?

I suppose one could also take information theory into account, because since pi is effectively random, at first glance it seems like it might take longer to identify a sequence that has a repeating pattern. Thinking more about it, I don't think that it would be any more difficult to find than any other specific sequence, but I'm not certain.

As you can probably tell by the horribly imprecise language above and my complete lack of reputation on this site, I'm not that familiar with these types of math, so if the tags below are inappropriate, please help me out by editing them.

Best Answer

Essentially, on average, how many digits of pi would you have to have to go through to find a specific sequence of a given length?

If the statistical behavior of the digits of $\pi$ is modeled by an i.i.d. sequence of random variables uniform on $\{0,1,2,3,4,5,6,7,8,9\}$, the following theorem is relevant:

If $(X_1, X_2, X_3, \dots)$ is a sequence of random variables that are i.i.d. uniform on a set $S$ of $m$ elements, then the expected first-occurrence time of a contiguous subsequence $\tau\in S^n$ is given recursively by
$$ET_\tau = m^n + ET_\sigma $$ where $\sigma$ is the longest proper suffix of $\tau$ that is also a prefix of $\tau$. Here $T_\tau$ is defined as the least $k$ such that $\tau$ is a contiguous subsequence of $(X_1, \dots , X_k)$, and by convention $T_{\text{empty sequence}}=0$.

(A proof of this is in the textbook by Sheldon M. Ross, "Introduction to Probability Models", 10th Edition, Example 4.26, pp 225-228. Ross proves a much more general result for Markov chains.)

E.g., with $m = 10$, $$\begin{align} ET_{01201201} & = 10^8 + ET_{01201}\\ & = 10^8 + 10^5 + ET_{01}\\ & = 10^8 + 10^5 + 10^2 + ET_{\text{empty sequence}}\\ & = 10^8 + 10^5 + 10^2 \end{align} $$ whereas the first-occurrence time of this sequence in the digits of $\pi$ is $91,997,854 \approx 0.9 ET_{01201201}$.

Some observations:

  • A bounding case is when $\tau$ is composed of $n$ copies of a single symbol, e.g. $0^n = 000...0$, so $$\begin{align} ET_{0^n} & = m^n + ET_{0^{n-1}}\\ & = m^n + m^{n-1} + ET_{0^{n-2}}\\ & ...\\ & = m^n + m^{n-1} + ... + 1\\ & = \frac{m^{n+1} - m}{m-1} \end{align} $$
    Hence, for all $\tau$, $$m^n \le ET_\tau \le \frac{m^{n+1} - m}{m-1} \lt m^{n+1} $$ so the growth is exponential in subsequence-length $n$.

  • Interestingly, $ET_\tau$ is always an integer.