[Math] the probability of a specific sequence of 11 digits occurring in a random sequence of one billion digits

piprobabilityrandom

This isn't homework, I'm actually (please don't ask me why) wondering how likely it is that any particular 11-digit telephone number will occur in the first billion digits of pi. My probability course was way too long ago, and the idea of creating a monte carlo simulation to figure this out seems a little extreme!

(And I realize that pi is not a random number, I'm just assuming that the digits are sequenced in a randomlike way.)

Best Answer

The problem is actually fairly complicated, and I would refer you to the article "A unified approach to word occurrence probabilities" (link) for an exhaustive analysis. But we can approximate the result more simply.

Note that the expected number of occurrences (counting overlaps) is trivial to obtain: there are $N-n+1$ possible starting positions, and the probability of an occurrence at each such position is $10^{-n}$, so the expected number of occurrences is exactly $(N-n+1)10^{-n}$. For $N=10^9$ and $n=11$, this is $10^{-2} - 10^{-10} \approx 0.01$. So we know that $$ \sum_{i=0}^{\infty}i\cdot p_i = p_1 + 2p_2 + \cdots=p_{\ge 1}+p_{\ge 2}+\cdots=(N-n+1)10^{-n}, $$ where $p_i$ is the probability of exactly $i$ occurrences and $p_{\ge i}$ is the probability of at least $i$ occurrences. In particular, we have this exact result: $$ p_0=1-p_{\ge 1}=1-(N-n+1)10^{-n}+p_{\ge 2}+p_{\ge 3}+\cdots. $$ We see that the likelihood of there being no occurrences increases, perhaps surprisingly, along with the likelihood of there being multiple occurrences. Now, in your case $N\gg n$, so we can largely ignore the existence of one occurrence when looking for another: any pairs are almost certain to be far apart, so the probability of any pairs is about $\frac{1}{2}(0.01)^2$, giving $$ p_0 \approx 1 - 0.01 + \frac{1}{2}(0.01)^2=0.99005. $$ However, there are two exceptions arising from self-overlapping phone numbers. If the phone number consists of a single digit (1-111-111-1111), then given one occurrence, an immediate recurrence has probability $10^{-1}$. This makes $p_{\ge 2}\approx 10^{-3}$ instead (twenty times larger than usual), hence $p_0\approx 0.991$. Similarly, if the phone number consists of a repeating pair of digits (1-212-121-2121), then an immediate recurrence has probability $10^{-2}$. This combines with the (comparable) probability of an independent second occurrence to give $p_{\ge 2}\approx 1.5\times 10^{-4}$ (three times larger than usual), hence $p_0\approx 0.99015$. (Phone numbers that self-overlap at longer intervals, like 1-213-121-3121, are also slightly less likely to occur, but there the effect is small enough to be ignored.)

Related Question