[Math] Probability of a substring occurring in a string

combinatoricsmarkov chainsprobability

Consider a random string of length $n<\infty$ where each digit can be between 0-9 with equal probability and a substring of length $k<n$ consisting of only zeros.

What is the probability of observing at least two occurrences of the substring?

Assumption: The same digit can be used in more than one substring. For example: Let the substring we are looking for be "$0000$". We observe somewhere in the string:

$\ldots59800000723\ldots$

and we will count this as two occurrences as

$\ldots 598\color{red}{0000}0723\ldots$ and $\ldots 5980\color{red}{0000}723\ldots$


My attempts at a solution: With $k=4$:

1) Instead of looking at it directly I have tried finding
\begin{align}
P(\text{"}0000\text{"} \text{ occurring }\geq 2) &= 1-P(\text{"}0000\text{"} \text{ occurring }<2)\\
&=1- P( \text{"}0000\text{"} \text{ not occuring}, \text{"}0000\text{"} \text{ occuring once})
\end{align}
but as the occurrences overlap I'm not sure how to continue.

2)
Consider a Markov chain with state probabilities
\begin{align}
p_{i,0} &= 0.9, \quad i=0,1,2,\ldots \\
p_{i,i+1} &= 0.1, \quad i= 1,2,\ldots
\end{align}
This is then the counting of the number of consecutive zeros. Finding the expected number of visits in state 4 and up I thought could lead me somewhere but a visit in state 5 implies a visit in state 4 so it would … add one occurrence and not two.

Best Answer

This sounds like a job for the Goulden-Jackson cluster method. Doron Zeilberger and John Noonan have a gem of a paper (and Maple programs) about it. In short, this method takes a collection of "bad" words over a finite alphabet and efficiently produces a two-variable generating function for words of various lengths containing specified numbers of bad words. In other words, if $s$ is the variable representing the lengths of strings and $t$ is the variable representing the numbers of bad words and $f$ is the generating function, then the coefficient of $s^mt^n$ in $f$ gives the number of strings of length $m$ that have exactly $n$ bad words. These generating functions always turn out to be rational.

In your example, you want to calculate the probability of selecting a string of length $n$ from the alphabet with ten letters $\{0,\dots,9\}$ containing at least two occurrences of $0000$. In the method, this corresponds to the ten-letter alphabet with a single bad word $0000$. Zeilberger and Noonan's Maple program gives the corresponding generating function $$f(s,t)=-{\frac {{s}^{3}t-{s}^{3}+{s}^{2}t-{s}^{2}+st-s-1}{9\,t{s}^{4}-9\,{s}^ {4}+9\,{s}^{3}t-9\,{s}^{3}+9\,{s}^{2}t-9\,{s}^{2}-st-9\,s+1}} $$ To compute the probability of selecting such a string, you would need to compute the number of strings of length $n$ with absolutely no bad words, i.e. $[s^nt^0]f(s,t)=[s^n]f(s,0)$, and the number of strings of length $n$ with exactly one bad word, i.e. $[s^nt^1]f(s,t)=[s^n]\partial_t f(s,t)_{|t=0}$. Granted, computing these coefficients will take some work, but I think the problem is very do-able from this stage.

Related Question