Number of distinct binary strings made by inserting m characters into a given string

combinatorics

Let $s$ be a length $n$ binary string. It is natural to ask how many distinct length $n+m$ binary strings can be made by inserting $m$ binary characters into $s$.

As an example, we can consider the distinct strings made by inserting two characters into the string $001$. Since we are inserting two characters, we are either inserting two $0$'s, one $1$ and one $0$, or two $1$'s. Inserting two $0$'s may be done in the three following distinct ways: $$00001, 00010, 00100~.$$ Inserting one $1$ and one $0$ may be done in the seven following distinct ways: $$10001, 01001, 00101, 00011, 10010, 01010, 00110~.$$ Finally, inserting two $1$'s may be done in the six following distinct ways: $$11001, 10101, 10011, 01101, 01011, 00111~.$$ Therefore we have 16 distinct strings made by inserting two characters into $001$.

Using a similar process, we can show that there are $16$ distinct strings made by inserting two characters into $000$. Similarly, there are $16$ distinct strings made by inserting two characters into $010$. In fact, for any length $3$ binary string $s$ there are $16$ distinct strings made by inserting two characters into $s$.

After discovering this, I wrote some Python code to see if this sort of thing holds more generally. As it turns out, the number of distinct strings made by inserting $m$ characters into a binary string $s$ is the same for all $s$ of length $n$. Furthermore, we get the following formula for the number of distinct strings $S$ as a function of $n$ and $m$:

$$S(n,m) = {n+m \choose 0} + {n+m \choose 1} + {n+m \choose 2} + … + {n+m \choose m}~.$$

If we assume that $S(n,m)$ is well-defined for all $n$ and $m$ (that is, if different base strings of length $n$ give the same answer), then proving this formula is relatively straightforward. In the case of the base string $00 \ldots 0$ of length $n$, the strings of length $n+m$ formed by inserting $m$ characters are precisely the binary strings of length $n+m$ containing at least $n$ zeros. These can be counted by ignoring the strings of length $n+m$ with fewer than $n$ zeros.

$$\begin{align}S(n,m) &= 2^{n+m} – {n+m \choose 0} – {n+m \choose 1} – … – {n+m \choose n-1} \\ &= {n+m \choose n} + {n+m \choose n+1} + … + {n+m \choose n+m} \\ &= {n+m \choose 0} + {n+m \choose 1} + … + {n+m \choose m}.\end{align}$$

This proof of the formula only works, however, if we are willing to grant that the number of distinct strings depends only on $m$ and on the length $n$ of the base string. I was unable to prove this fact. Can someone provide a deeper explanation of what's going on here?

Best Answer

Think of it in a different way. Given binary strings $T$ and $S$ of lengths $n+m$ and $n$ respectively, $T$ can be obtained from $S$ by inserting $m$ characters if and only if $S$ is a subsequence of $T$. Now, when is that true?

If it is, the subsequence can be obtained as follows. Starting at the left, let $i_1$ be the first position $i$ where $T_i = S_1$, let $i_2$ be the first $i > i_1$ where $T_i = S_2$, ..., $i_n$ the first $i > i_{n-1}$ where $T_i = S_n$. If you can define $i_1, \ldots, i_n$ in this way, i.e. you never get to a point where there are no $i > i_{k-1}$ such that $T_i = S_k$, then $S$ is the subsequence $T_{i_1},\ldots, T_{i_n}$.

Now, given two binary strings $S$ and $S'$ of length $n$, I will define a map $f$ from $\{0,1\}^{n+m}$ to itself such that $S$ is a subsequence of $T$ if and only if $S'$ is a subsequence of $f(T)$. Consider the construction above for $T$ and $S$, obtaining $i_1, \ldots, i_k$ (where either $k=n$, in which case $S$ is a subsequence of $T$, or $k < n$ and $i_{k+1}$ does not exist). For convenience, take $i_0 = 0$, and $i_{k+1} = n+m+1$. If $i_{j-1} < i \le i_{j}$, let $f(T)_i = T_i$ if $S_j = S'_j$ (or $j = n+1$), $1-T_i$ if $S_j \ne S'_j$. It is easy to see that $f$ is a one-to-one correspondence and has the desired property. This shows that the number of $T$ with $S$ as a subsequence is equal to the number with $S'$ as a subsequence.