[Math] Counting the number of bit strings containing a substring

bit-stringscombinatorics

Consider a general bit string of length $n$; how many bit strings are there that contain a substring $T$?

For example, given a bit string of length 6, how many are there that contain $110$ as a substring, and how many that contain $101$?

Best Answer

Given the comments, I think you should check the article here : https://oeis.org/A005251 in OEIS.

The mentioned recurrence formula is : a(n) = a(n-1) + a(n-2) + a(n-4) and it may be obtained by taking into account four kind of sequences:

$A_n$ : good sequences of length n that finish in 00

$B_n$ : good sequences of length n that finish in 01

$C_n$ : good sequences of length n that finish in 10

$D_n$ : good sequences of length n that finish in 11

and see what happens if we add one digit.

their sum $E_{n+1}$ would be expressed :

$$E_{n+1} = A_{n+1} + B_{n+1}+C_{n+1}+D_{n+1} = (A_n+B_n+C_n+D_n) + A_{n-1} + B_{n-1} + D_{n-1} = ....... =E_n + E_{n-1} + E_{n-3}$$

where the first four E's are 4, 7, 12, 21 for n = 2, 3, 4, 5 and they have to be obtained by hand.

Related Solutions

[Math] How many binary strings (with a given number of occurrences of 0 and 1) are there that do not contain a given substring

This answer is based upon the Goulden-Jackson Cluster Method.

We consider the set words of length $n\geq 0$ built from an alphabet $$\mathcal{V}=\{0,1\}$$ and the set $B=\{010\}$ of bad words, which are not allowed to be part of the words we are looking for. We derive a generating function $f(s)$ with the coefficient of $s^n$ being the number of searched words of length $n$.

According to the paper (p.7) the generating function $f(s)$ is \begin{align*} f(s)=\frac{1}{1-ds-\text{weight}(\mathcal{C})}\tag{1} \end{align*} with $d=|\mathcal{V}|=2$, the size of the alphabet and $\mathcal{C}$ the weight-numerator of bad words with \begin{align*} \text{weight}(\mathcal{C})=\text{weight}(\mathcal{C}[010]) \end{align*}

We calculate according to the paper \begin{align*} \text{weight}(\mathcal{C}[010])&=-s^3-s^2\text{weight}(\mathcal{C}[010])\tag{1}\\ \end{align*}

and get \begin{align*} \text{weight}(\mathcal{C})=-\frac{s^3}{1+s^2} \end{align*}

It follows

\begin{align*} f(s)&=\frac{1}{1-ds-\text{weight}(\mathcal{C})}\\ &=\frac{1}{1-2s+\frac{s^3}{1+s^2}}\tag{2}\\ &=\frac{1+s^2}{1-2s+s^2-s^3}\\ &=1+2s+4s^2+7s^3+12s^4+\color{blue}{21}s^5\\ &\qquad+37s^6+65s^7+114s^8+200s^9+351s^{10}+\cdots \end{align*}

The last line was calculated with the help of Wolfram Alpha. The coefficient of $s^5$ shows there are $\color{blue}{21}$ valid words of length $5$ from the alphabet $\{0,1\}$ which do not contain the word $010$.

But we want to also keep track of the number of $0$'s and $1$'s. We get a refinement of $f(s)$ by marking the $0$ with $x$ and the $1$'s with $y$. We obtain from (1) \begin{align*} \text{weight}(\mathcal{C}[010])&=-x^2ys^3-xys^2\text{weight}(\mathcal{C}[010]) \end{align*} and get \begin{align*} \text{weight}(\mathcal{C})=-\frac{x^2ys^3}{1+xys^2} \end{align*}

Using this generalized weight we obtain from (2) a generating function $g(s;x,y)$ \begin{align*} g(s;x,y)&=\frac{1}{1-(x+y)s+\frac{x^2ys^3}{1+xys^2}}\\ &=\frac{1+xys^2}{1-(x+y)s+xys^2-xy^2s^3}\\ &=1+(x+y)s+(x^2+2xy+y^2)s^2+(x^3+2x^2y+3xy^2+y^3)\\ &\qquad+(x^4+2x^3y+4x^2y^2+4xy^3+y^4)s^4\\ &\qquad+(x^5+2x^4y+5x^3y^2+\color{red}{7}x^2y^3+5xy^4+y^5)s^5+\cdots \end{align*}

So, e.g. out of $2^5=32$ binary words of length $5$ there are $\color{blue}{21}$ valid words which do not contain $010$ and $\color{red}{7}$ of them contain $n=2$ zeros and $m=3$ ones:

\begin{array}{cccc} \qquad\color{blue}{00000}\qquad&\qquad01000\qquad&\qquad\color{blue}{10000}\qquad&\qquad\color{blue}{11000}\qquad\\ \color{blue}{00001}&01001&\color{blue}{10001}&\color{red}{11001}\\ 00010&01010&10010&11010\\ \color{blue}{00011}&01011&\color{red}{10011}&\color{blue}{11011}\\ 00100&\color{blue}{01100}&10100&\color{red}{11100}\\ 00101&\color{red}{01101}&10101&\color{blue}{11101}\\ \color{blue}{00110}&\color{red}{01110}&\color{red}{10110}&\color{blue}{11110}\\ \color{red}{00111}&\color{blue}{01111}&\color{blue}{10111}&\color{blue}{11111}\\ \end{array}

Combinatorics – Counting Binary Strings Without Block Decompositions

We denote with

run of length $n$ a substring of $n$ consecutive, equal characters from a binary alphabet $\{0,1\}$.
A maximum run of a string is a run with maximum length.
An $n$-length zero-run is a run consisting of $n$ consecutive $0$s, and a one-run is specified accordingly.

We derive a generating function for the number $a_n$ of strings of length $n$ having no odd, maximum zero-run.

To avoid ambiguities we follow OPs comment and allow odd maximum runs of $1$'s together with odd runs of $0$'s as long as they are less in length than the odd maximum run of $1$'s which seems to be a somewhat more challenging problem.

The string $$0111\color{blue}{000}11$$ has an odd maximum zero-run of length $3$ equal to the maximum one-run and is not admissible. The string $$\color{blue}{0}111011$$ has an odd maximum zero-run of length $1$ less than the maximum one-run of length $3$ and is admissible.

We derive the number of admissible words of length $n$ by counting those words which are not admissible. The number of wanted words is $2^n$ minus this quantity. We do so by deriving a generating function \begin{align*} G^{(=2k+1,\geq 2k+1)}(z)\qquad\qquad\qquad k\geq 0 \end{align*} The first component of the index gives the maximum run length of $0$'s, the second component gives the maximum run length of $1$'s. So, $[z^n]G^{(=2k+1,\geq 2k+1)}(z)$ gives the number of words of length $n$ having maximum zero-runs of length $2k+1$ and maximum one-runs of length less or equal to $2k+1$.

It follows the number $a_n$ of admissible words of length $n$ is \begin{align*} a_n=2^n-[z^n]\sum_{k=0}^{\lfloor (n-1)/2\rfloor}G^{(=2k+1,\geq 2k+1)}(z)\qquad\qquad n\geq 0 \end{align*}

We build the generating functions $G(z)$ from the generating function of Smirnov words also called Carlitz words. These are words with no consecutive equal characters at all. (See example III.24 Smirnov words from Analytic Combinatorics by Philippe Flajolet and Robert Sedgewick for more information.)

A generating function for the number of Smirnov words over a binary alphabet is given by \begin{align*} \left(1-\frac{2z}{1+z}\right)^{-1} \end{align*}

Replacing occurrences of $0$ in a Smirnov word by one up to $2k+1$ zeros generates words having runs of $0$ with length less or equal $2k+1$. \begin{align*} z\longrightarrow z+z^2+\cdots+z^{2k+1}=\frac{z\left(1-z^{2k+1}\right)}{1-z} \end{align*}

The same can be done with ones and the resulting generating function is \begin{align*} G^{(\geq 2k+1,\geq 2k+1)}(z)=\left(1-2\cdot\frac{\frac{z\left(1-z^{2k+1}\right)}{1-z}}{1+\frac{z\left(1-z^{2k+1}\right)}{1-z}}\right)^{-1} &=\frac{1-z^{2k+1}}{1-2z+z^{2k+1}} \end{align*}

Since we need a generating function counting words having exactly $2k+1$ zero-runs we calculate \begin{align*} &G^{(=2k+1,\geq 2k+1)}(z)=G^{(\geq 2k+1,\geq 2k+1)}(z)-G^{(\geq 2k,\geq 2k+1)}(z)\\ &\qquad=\left(1-\frac{\frac{2z\left(1-z^{2k+1}\right)}{1-z}}{1+\frac{z\left(1-z^{2k+1}\right)}{1-z}}\right)^{-1} -\left(1-\frac{\frac{z\left(1-z^{2k}\right)}{1-z}}{1+\frac{z\left(1-z^{2k}\right)}{1-z}} -\frac{\frac{z\left(1-z^{2k+1}\right)}{1-z}}{1+\frac{z\left(1-z^{2k+1}\right)}{1-z}}\right)^{-1}\\ &\qquad=\frac{1-z^{2k+2}}{1-2z+z^{2k+2}}-\frac{(1-z^{2k+1})(1-z^{2k+2})}{1-2z+z^{2k+2}+z^{2k+3}-z^{4k+3}}\\ &\qquad=\frac{(1-z^{2})(1-z^{2k+2})z^{2k+1}}{(1-2z+z^{2k+2})(1-2z+z^{2k+2}+z^{2k+3}-z^{4k+3})} \end{align*}

We finally conclude

The number $a_n$ of admissible words of length $n$ is \begin{align*} a_n&=2^n-[z^n]\sum_{k=0}^{\lfloor (n-1)/2\rfloor}G^{(=2k+1,\geq 2k+1)}(z)\\ &=2^n-[z^n]\sum_{k=0}^{\lfloor (n-1)/2\rfloor} \frac{(1-z^{2})(1-z^{2k+2})z^{2k+1}}{(1-2z+z^{2k+2})(1-2z+z^{2k+2}+z^{2k+3}-z^{4k+3})}\tag{1} \end{align*} The sequence $(a_n)$ starts with \begin{align*} (a_n)_{n\geq 1}=(1,2,5,12,24,\color{blue}{48},95,187,367,724,\ldots) \end{align*}

Example: Let's calculate the number $a_6$ of admissible words of length $6$.

According to (1) we obtain the following series expansions (with some help of Wolfram Alpha) \begin{align*} G^{(=1,\leq 1)}(z)&=\frac{(1-z)^2(1-z^2)z}{(1-2z+z^2)(1-2z+z^2)}=\frac{(1+z)z}{1-z}\\ &=z+2z^2+2z^3+2z^4+2z^5+\color{blue}{2}z^6+2z^7+2z^8+2z^9+2z^{10}\cdots\\ G^{(=3,\leq 3)}(z)&=z^3+2z^4+5z^5+\color{blue}{12}z^6+25z^7+53z^8+109z^9+220z^{10}\cdots\\ G^{(=5,\leq 5)}(z)&=z^5+\color{blue}{2}z^6+5z^7+12z^8+28z^9+64z^{10}\cdots\\ \end{align*}

We conclude the number $a_6$ is given by \begin{align*} a_6&=2^6-[z^6]\sum_{k=0}^2G^{(=k,\leq k)}(z)\\ &=64-[z^6]\left(G^{(=1,\leq 1)}(z)+G^{(=3,\leq 3)}(z)+G^{(=5,\leq 5)}(z)\right)\\ &=64-\color{blue}{2}-\color{blue}{12}-\color{blue}{2}\\ &=48 \end{align*}

The $64-a_6=16$ non-admissible words are

\begin{array}{cccc} 1\color{blue}{0}1010&\color{blue}{0}10101\\ 001\color{blue}{000}&101\color{blue}{000}&011\color{blue}{000}&111\color{blue}{000}\\ \color{blue}{000}100&1\color{blue}{000}10&\color{blue}{000}110&01\color{blue}{000}1\\ 11\color{blue}{000}1&\color{blue}{000}101&1\color{blue}{000}11&\color{blue}{000}111\\ 1\color{blue}{00000}&\color{blue}{00000}1\\ \end{array}

Best Answer

Related Solutions

[Math] How many binary strings (with a given number of occurrences of 0 and 1) are there that do not contain a given substring

Combinatorics – Counting Binary Strings Without Block Decompositions

Related Question