Average number of strings with edit distance exactly 2

combinatoricsprobability

Consider a binary string of length $n \geq 2$. An edit operation is a single character insert, delete or substitution. The edit distance between two strings is the minimum number of edit operations needed to transform one string into the other one. Given a string $S$, my question relates to the number of distinct strings of length $n$ which are edit distance $2$ from $S$.

Let us write $f_2(S)$ for the number of distinct strings of length $n$ which are edit distance $2$ from $S$.

Let $X_n$ be a random variable representing a random binary string of length $n$, with the bits chosen uniformly and independently. My question is what is:

$$\mathbb{E}(f_2(X_n))\;?$$

For small $n$ we can compute the value exactly:

  • $\mathbb{E}(f_2(X_2)) = 1$.
  • $\mathbb{E}(f_2(X_3)) = 3 \frac{1}{4}$.
  • $\mathbb{E}(f_2(X_4)) = 7 \frac{1}{8}$.
  • $\mathbb{E}(f_2(X_5)) = 12 \frac{13}{16}$.
  • $\mathbb{E}(f_2(X_6)) = 20 \frac{13}{32}$.
  • $\mathbb{E}(f_2(X_7)) = 29 \frac{61}{64}$.
  • $\mathbb{E}(f_2(X_8)) = 41 \frac{61}{128}$.
  • $\mathbb{E}(f_2(X_9)) = 54 \frac{253}{256}$.
  • $\mathbb{E}(f_2(X_{10})) = 70 \frac{253}{512}$.

See What is the expected number of distinct strings from a single edit operation? for a related question about edit distance 1 which has a very clean and simple solution.

Best Answer

Since you want the length to remain unchanged and $2$ to be the minimal edit distance, the only options are two substitutions in different places, or an insertion and a deletion. (It doesn't matter in which order we carry out the insertion and the deletion.) It's straightforward that there are $\binom n2=\frac{n(n-1)}2$ different results of two substitutions in different places, so the task is to count the strings produced by an insertion and a deletion that can't be produced by at most two substitutions.

Let's count the cases where the insertion is to the left of the deletion and then multiply by $2$. The combined effect of the insertion and the deletion is to shift all $k$ bits between them to the right while replacing the first one and removing the last one. This result can also be achieved by at most $k$ substitutions, so we need $k\gt2$. Inserting $x$ within a run of $x$s has the same effect as inserting $x$ at the end of the run. Thus we can count all insertions with different effects once by always inserting the bit complementary to the one to the right of the insertion. Similarly, a deletion within a run has the same effect as a deletion at the start of the run, so we should only count deletions that follow a change between $0$ and $1$.

That gives us an initial count of

$$ 2\cdot\frac12\sum_{k=3}^n(n+1-k)=\sum_{k=1}^{n-2}k=\frac{(n-1)(n-2)}2\;, $$

which together with $\frac{n(n-1)}2$ from the substitutions yields $(n-1)^2$. That's already of the order of the counts you computed, but a bit too high, so we're overcounting.

If there are no further changes in the $k$ shifted bits other than the one preceding the deletion, then only the bits next to the insertion and deletion change, and we can achieve that with $2$ substitutions, so we have to subtract

$$ \sum_{k=3}^n\left(\frac12\right)^{k-2}(n+1-k)=\sum_{k=1}^{n-2}\left(\frac12\right)^{n-k-1}k=n-3+2^{-(n-2)}\;. $$

Also, if the entire range of shifted bits consists of alternating zeros and ones, then swapping the insertion and the deletion yields the same effect, so in this case we were double-counting and need to subtract

$$ \sum_{k=3}^n\left(\frac12\right)^{k-1}(n+1-k)\;, $$

which is half the previous sum. Thus, the expected number of binary strings of length $n$ at edit distance exactly $2$ from a uniformly randomly selected binary string of length $n$ is

$$ (n-1)^2-\frac32\left(n-3+2^{-(n-2)}\right)=n^2-\frac72n+\frac{11}2-6\cdot2^{-n}\;, $$

in agreement with your computed results.

Related Question