Probability of the success run of a given length

probability

I am curious to know an easy explanation of the following derivation.

Consider a Bernoulli process consisting of $n$ trials. The probability of success (T) is $p$. The probability of failure (F) is $q=1-p$. Then the probability of having the longest success run of length $m$ ($\ldots \text{F} \underbrace{\text{TT}\ldots \text{T}}_m \text{F}\ldots$) under the condition of having $n_F$ failures is equal to
$$
P(L_n = m) = (1-p^{m+1})^{n_F}-(1-p^{m})^{n_F}.
$$
This statement is made in Mark F. Schilling, The Surprising Predictability of Long Runs, Mathematical Magazine 85, 141 (2012) on page 147.
The author treats $n_F$ as an independent (given) value and sets it to the expectation value of $n_F$ equal to $nq$ in order to derive
$$
P(L_n = m) \approx p^{p/q}-p^{1/q}.
$$
In my opinion the independence of these events, i.e., having $n_F$ failures and having at least one successful run of length $m$ correlate and therefore the starting equation is false. For instance, if I set $n_F = 0$ and $m<n$ the probability should be 0, which contradicts the first formula (the whole sequence of length $n$ is $\underbrace{\text{TT}\ldots\text{T}}_n$, whereas we are seeking for a success sequence of length $m<n$). Analogically, if $n_F=n$ the whole sequence if $\underbrace{\text{FF}\ldots\text{F}}_n$, therefore the probability $P(L_n = m)$ should be zero again, which contradicts the formula.

But let us assume for a moment that $n_F$ is very close to the real one. The formula implies that the whole sequence looks like
$$
\underbrace{FTT\ldots T}_\text{subseq. 1}\underbrace{F TT\ldots T}_\text{subseq. 2}\ldots \underbrace{F TT\ldots T}_{\text{subseq.} n_F}
$$
But how can we guarantee that the whole sequence starts with $F$?

Best Answer

You are correct. But so is the paper.

The paper explains it like this:

The probability of a success run of length at least $\mathscr{l}$ immediately following a particular failure is $p^{\mathscr{l}}$. Let $N_{\mathscr{l}}$ be the number of success runs of length of length $\mathscr{l}$ or longer in $n$ Bernoulli trials. If we condition on the number of failures $n_F$ in the sequence (treat it as given), we obtain \begin{eqnarray} P(L_n=\mathscr{l}) & = & P(N_{\mathscr{l}} > 0, N_{\mathscr{l}+1}=0 | n_F) \\ & = & P(N_{\mathscr{l}+1}=0 | n_F) - P(N_{\mathscr{l}}=0 | n_F) \\ & = & (1-p^{\mathscr{l}+1})^{n_F} - (1-p^{\mathscr{l}})^{n_F} \approx e^{-n_F p^{\mathscr{l}+1}} - e^{-n_F p^{\mathscr{l}}}. \end{eqnarray}

Let's consider an example. Suppose we consider random sequences of a million digits, and we are curious about runs of the digit 3. So we consider the digit 3 as a "success", and the other digits as "failures", so $p=0.1$, $q=0.9$, and $n=10^6$. We expect that there will be roughly $10^5$ successes and $9 \cdot 10^5$ failures.

Since $n$ is large, as the paper says, "we can't go very wrong" by assuming that there are exactly $10^5$ successes and $9 \cdot 10^5$ failures, and then looking at what is the longest run of 3s that we should expect in that specific case.

For example, what is the probability that the longest run of 3s has exactly six 3s in it (i.e., $\mathscr{l}=6$)? Well, after each of the $9 \cdot 10^5$ non-3s, we will have 0 or more 3s before the next non-3. The probability of having seven or more in a row is just the probability that the next seven digits are 3, which is $p^7$, so the probability of having 6 or fewer is $1-p^7$. From this, the probability that all $9\cdot10^6$ runs have length 6 or less is $(1-p^7)^{9\cdot10^5} \approx 0.914$. Similarly, the probability that all $9\cdot10^6$ runs have length 5 or less is $(1-p^6)^{9\cdot10^5} \approx 0.407$. Subtracting these gives us $0.507$, the probability that the maximum run length is exactly 6.

This also illustrates one of the main points of the paper, that the distribution for the maximal run length is (counterintuitively) highly concentrated. In this case the chance is over 50% that the maximal run of 3s will be of length exactly 6.

Notice that this derivation is sloppy in the following respect: We conditioned on $n_F$ so that we could easily express $P(\mbox{all runs are length 6 or less}) = (1-p^7)^{n_F}$ as an exponentiation with exponent $n_F$. But we conveniently forgot about this condition when deriving the base of this exponentiation: $1-p^7$ is only exact when we are not conditioning on $n_F$. If we are conditioning on $n_F$, we get a more complicated expression, because we are drawing samples without replacement. Furthermore, the exponentiation ceases to be correct, because the lengths of the runs of 3s are not independent. You were right to notice that these formulas are not exactly correct. But they are approximately correct when $n$ is large. In the third line of equations, the paper should have started the line with $"\approx"$ instead of $"="$.

To answer your final question, the approximation doesn't care if the sequence starts with $F$. If it doesn't start with $F$, then we are ignoring the first run of $T$s. We are computing a term like $0.9999999^{900000}$, so it makes almost no difference to ignore a run (or even a few, if $n_F$ was not exactly $q \cdot n$), since this only changes the exponent by one (or a small amount), and so the result of the exponentiation is roughly unchanged. (This is a case where using concrete numbers can help us understand what the author was thinking.) The main point is that these formulas in the paper are quick approximations.

The paper plays fast and easy with these formulas, because it is oriented towards applications, not theorems. For the types of situations it is considering, with $n \gg 0$, the approximations made in the paper are perfectly reasonable.

Best Answer

Related Solutions

CDF and Survival Function of Geometric Distribution

Probability of a random generated string containing more than $m$ equal characters

Related Question