Average number of strings with edit distance exactly 2

combinatoricsprobability

Consider a binary string of length $n \geq 2$. An edit operation is a single character insert, delete or substitution. The edit distance between two strings is the minimum number of edit operations needed to transform one string into the other one. Given a string $S$, my question relates to the number of distinct strings of length $n$ which are edit distance $2$ from $S$.

Let us write $f_2(S)$ for the number of distinct strings of length $n$ which are edit distance $2$ from $S$.

Let $X_n$ be a random variable representing a random binary string of length $n$, with the bits chosen uniformly and independently. My question is what is:

$$\mathbb{E}(f_2(X_n))\;?$$

For small $n$ we can compute the value exactly:

$\mathbb{E}(f_2(X_2)) = 1$.
$\mathbb{E}(f_2(X_3)) = 3 \frac{1}{4}$.
$\mathbb{E}(f_2(X_4)) = 7 \frac{1}{8}$.
$\mathbb{E}(f_2(X_5)) = 12 \frac{13}{16}$.
$\mathbb{E}(f_2(X_6)) = 20 \frac{13}{32}$.
$\mathbb{E}(f_2(X_7)) = 29 \frac{61}{64}$.
$\mathbb{E}(f_2(X_8)) = 41 \frac{61}{128}$.
$\mathbb{E}(f_2(X_9)) = 54 \frac{253}{256}$.
$\mathbb{E}(f_2(X_{10})) = 70 \frac{253}{512}$.

See What is the expected number of distinct strings from a single edit operation? for a related question about edit distance 1 which has a very clean and simple solution.

Best Answer

Since you want the length to remain unchanged and $2$ to be the minimal edit distance, the only options are two substitutions in different places, or an insertion and a deletion. (It doesn't matter in which order we carry out the insertion and the deletion.) It's straightforward that there are $\binom n2=\frac{n(n-1)}2$ different results of two substitutions in different places, so the task is to count the strings produced by an insertion and a deletion that can't be produced by at most two substitutions.

Let's count the cases where the insertion is to the left of the deletion and then multiply by $2$. The combined effect of the insertion and the deletion is to shift all $k$ bits between them to the right while replacing the first one and removing the last one. This result can also be achieved by at most $k$ substitutions, so we need $k\gt2$. Inserting $x$ within a run of $x$s has the same effect as inserting $x$ at the end of the run. Thus we can count all insertions with different effects once by always inserting the bit complementary to the one to the right of the insertion. Similarly, a deletion within a run has the same effect as a deletion at the start of the run, so we should only count deletions that follow a change between $0$ and $1$.

That gives us an initial count of

$$ 2\cdot\frac12\sum_{k=3}^n(n+1-k)=\sum_{k=1}^{n-2}k=\frac{(n-1)(n-2)}2\;, $$

which together with $\frac{n(n-1)}2$ from the substitutions yields $(n-1)^2$. That's already of the order of the counts you computed, but a bit too high, so we're overcounting.

If there are no further changes in the $k$ shifted bits other than the one preceding the deletion, then only the bits next to the insertion and deletion change, and we can achieve that with $2$ substitutions, so we have to subtract

$$ \sum_{k=3}^n\left(\frac12\right)^{k-2}(n+1-k)=\sum_{k=1}^{n-2}\left(\frac12\right)^{n-k-1}k=n-3+2^{-(n-2)}\;. $$

Also, if the entire range of shifted bits consists of alternating zeros and ones, then swapping the insertion and the deletion yields the same effect, so in this case we were double-counting and need to subtract

$$ \sum_{k=3}^n\left(\frac12\right)^{k-1}(n+1-k)\;, $$

which is half the previous sum. Thus, the expected number of binary strings of length $n$ at edit distance exactly $2$ from a uniformly randomly selected binary string of length $n$ is

$$ (n-1)^2-\frac32\left(n-3+2^{-(n-2)}\right)=n^2-\frac72n+\frac{11}2-6\cdot2^{-n}\;, $$

in agreement with your computed results.

Related Solutions

Generating function for distribution of 12 balls into 10 cells

There are $10^{12}$ ways to assign the balls to the cells, each of which we assume is equally likely. (It's easy to get off on the wrong foot in this problem by overlooking this assumption, for example by assuming that all non-negative integer solutions to $x_1+x_2+ \dots +x_{10} = 12$ are equally likely, which is a very unlikely assumption.) We want to count all the assignments in which exactly one cell is empty.

To do this, we will apply a variation of the Principle of Inclusion/Exclusion (PIE). Let's say an assignment of balls to cells has "property $i$" if cell $i$ is empty, for $i = 1,2,3,\dots ,10$, so our goal is to find the number of arrangements with exactly one of the properties. Further, we define $S_j$ as the total number of assignments which have $j$ of the properties (with over-counting), for $j = 1, 2, 3, \dots ,9$. Then we have $$S_j = \binom{10}{j} (10-j)^{12}$$ for $1 \le j \le 9$, since there are $\binom{10}{j}$ ways to pick the empty cells and $(10-j)^{12}$ ways to assign the balls to the remaining cells.

The variation of PIE we want to apply is that if there are $n$ properties, then the number of arrangements with exactly $m$ of the properties is $$N_m = S_m - \binom{m+1}{m} S_{m+1} + \binom{m+2}{m} S_{m+2} - \dots + (-1)^{n-m} \binom{n}{m}S_n$$ (Reference: Applied Combinatorics, Second Edition by Allan Tuckser, Section 8.2, Theorem 2; or An Introduction to Probability Theory and Its Applications, Volume I, Third Edition, by William Feller, Section IV.3.) The case we are interested in is $m=1$, $n=9$, so $$N_1 = S_1 - \binom{2}{1} S_2 + \binom{3}{1} S_3 - \dots + \binom{9}{1}S_9$$ which yields $N_1 \approx 8.08315 \times 10^{10}$. Therefore the probability of having exactly one cell empty is $$\frac{N_1}{10^{12}} \approx \boxed{0.0808315}$$

Average number of strings with edit distance at most 4

Instead of considering one string and ask about strings close to it, it's easier to take a more symmetric approach and ask for the number of pairs of strings of given length $n$ that have levenshtein distance at most 4. Then the average number of strings of distance at most 4 is that number divided by $2^n$. The counting will be done using finite automata. To motivate the automaton, let's consider a nondeterministic algorithm to compute the levenshtein distance.

I imagine a machine with two input tapes, initially at the first letter, and a counter, initially set to 0. The machine compares the letters at the position of the heads of the tapes, and does one of the following steps: move both reading heads forward and, if the read letters were different, increment the counter, or just move one head forward and increment the counter. (A head cannot move forward when it has already passed the end of the word.) When both heads are behind the words, the computation ends and the value of the counter is the result. I claim that this result is an upper bound to the levenshtein distance of the input words, and that there is a computation that returns exactly the distance.

I'll formalize this description with pseudocode where I replace the tapes with variables containing the remaining, unprocessed string. Note that I only look at its first element (head), and only replace it with a version where this element is removed (tail):

algorithm levdist(l,r):
  # nondeterministically compute an upper bound for the levenshtein distance
  # in a why where the real distance is among the possible results
  c := 0;
  while l and r are not both empty
    do one of these operations:
      - a:=head(l); b:=head(r); l:=tail(l); r:=tail(r); if a<>b then c:=c+1;
      - l:=tail(l); c:=c+1;
      - r:=tail(r); c:=c+1;
  return c;

Let's now simulate this computation with a nondeterministic (infinite) automaton over the alphabet of pairs of bits, such that the input $(l_1,r_1),(l_2,r_2)\dots(l_n,r_n)$ corresponds to input $(l_1l_2\dots l_n,r_1r_2\dots r_n)$ to the original machine. Note that now the words are of the same length, a restriction we didn't need before.

The first machine can compare letters $a_i$ and $b_j$ at different positions. If e.g. $i<j$, the automaton has to read up to $(a_j,b_j)$ to know $b_j$, and then still must remember $a_i$ (and the following bits) in its state. So to be able to simulate the machine, the automaton will have states that know the counter, which (if any) tape is behind, and what the unprocessed bits on that tape were.

So we have states $N_c$ for $c\geq0$ for when the tapes are in sync, and states $L_{c,w}$ and $R_{c,w}$ with nonempty words $w$ and $c\geq |w|$, the length of $w$. To simplify the following, let's allow $L_{c,\epsilon}$ and $R_{c,\epsilon}$ as alternate names for $N_c$, where $\epsilon$ is the empty word.

For input $(l,r)$, the states $N_c$ have transitions to itself if $l=r$, otherwise to $N_{c+1}$, and to $L_{c+1,l}$ and $R_{c+1,r}$. If $\omega$ is a bit and $w=\omega w'$, then for that same input $L_{c,w}$ has transitions to $L_{c',w'l}$ with $c'=c$ if $\omega=r$, otherwise $c'=c+1$, and to $L_{c+1,wl}$. It also has an $\epsilon$-transition to $L_{c+1,w'}$. The transitions of the $R$ state are likewise, with the roles of $l$ and $r$ swapped.

When the automaton starts with initial state $N_0$ and processes some input, it will end in some state from which an $N_c$ state can be reached. This $c$ is an upper bound for the levenshtein distance of the input. Each computation of the first machine corresponds to a computation of the automaton that ends in the $N_c$ state with the $c$ that the machine computes.

Now we don't really want to compute the distance, we only want to know if it is at most 4, which we'll generalize to $k$. (And for real computations, we prefer finite automata.) To achieve this, we can restrict the states to those that have $c\leq k$. Now some of the $L_{c,w}$ and $R_{c,w}$ states can no longer reach an $N$ state. When we remove them, we keep those with $1\leq c\leq |w|\leq k-|w|$. When this automaton processes some input with initial state $N_0$ and doesn't get stuck, we know that it could reach a state $N_c$ (with $c\leq k$), so we can just declare all states as accepting states.

The following GAP code which uses the automata package (of which we will use more below) formalizes the construction of this nondeterministic finite automaton:

LoadPackage("automata");

levnfa := function(k)
  local kh, d, c, states, l, w, AddT, dir, hd, tl, c0, c1,
        t00, t01, t10, t11, te, e00, e01, x01, e10, x10, e11, ee;

  kh:=QuoInt(k,2);
  d:=NewDictionary([],true);

  # prepare states
  for c in [0..k] do
    AddDictionary(d,[0,c],c+1);
    AddDictionary(d,[1,c],c+1);
  od;
  states:=k+1;
  for l in [1..kh] do
    for w in IteratorOfTuples([0,1],l) do
      for c in [l..k-l] do
        states:=states+1;
        AddDictionary(d,Concatenation([0,c],w),states);
        states:=states+1;
        AddDictionary(d,Concatenation([1,c],w),states);
      od;
    od;
  od;


  AddT:=function(sl,arg)
    # add state id of state described by arg to sl if it exists,
    # arg is [dir, c, w, x] where x is an optional extra bit appended to w
    local x,l,val;
    if Length(arg)=4 then
      x:=[arg[4]];
    else
      x:=[];
    fi;
    l:=Concatenation(arg{[1,2]},arg[3],x);
    val := LookupDictionary(d,l);
    if val<>fail then
      Add(sl,val);
    fi;
  end;


  t00:=[]; t01:=[]; t10:=[]; t11:=[]; te:=[];
  for c in [0..k] do
    e00:=[]; e01:=[]; e10:=[]; e11:=[];
    AddT(e00,[0,c,[]]); AddT(e11,[0,c,[]]);
    AddT(e01,[0,c+1,[]]); AddT(e10,[0,c+1,[]]);
    AddT(e00,[0,c+1,[0]]); AddT(e00,[1,c+1,[0]]);
    AddT(e01,[0,c+1,[0]]); AddT(e01,[1,c+1,[1]]);
    AddT(e10,[0,c+1,[1]]); AddT(e10,[1,c+1,[0]]);
    AddT(e11,[0,c+1,[1]]); AddT(e11,[1,c+1,[1]]);
    Add(t00,e00); Add(t01,e01); Add(t10,e10); Add(t11,e11); Add(te,[]);
  od;
  for l in [1..kh] do
    for w in IteratorOfTuples([0,1],l) do
      for c in [l..k-l] do
        for dir in [0,1] do
          e00:=[]; x01:=[]; x10:=[]; e11:=[]; ee:=[];
          hd:=w[1];
          tl:=w{[2..Length(w)]};
          if hd=0 then
            c0:=0; c1:=1;
          else
            c0:=1; c1:=0;
          fi;
          AddT(e00,[dir,c+c0,tl,0]);
          AddT(x01,[dir,c+c1,tl,0]);
          AddT(x10,[dir,c+c0,tl,1]);
          AddT(e11,[dir,c+c1,tl,1]);
          AddT(e00,[dir,c+1,w,0]);
          AddT(x01,[dir,c+1,w,0]);
          AddT(x10,[dir,c+1,w,1]);
          AddT(e11,[dir,c+1,w,1]);
          AddT(ee, [dir,c+1,tl]);
          if dir=0 then
            e01:=x01; e10:=x10;
          else
            e01:=x10; e10:=x01;
          fi;
          Add(t00,e00); Add(t01,e01); Add(t10,e10); Add(t11,e11); Add(te,ee);
        od;
      od;
    od;
  od;

  return Automaton( "epsilon", states, 5,
                    [t00, t01, t10, t11, te],
                    [1], [1..states] );
end;

The automaton that we get can be transformed into a minimal deterministic one. From its transition function we can get a matrix $M$ that contains the number of ways we can get from one state to another in one step. From the matrix $M^n$ we can see how many ways there are to get from one state to another in $n$ steps. The number of ways to get from the initial state to a terminal one in $n$ steps equals the number of accepted words, i.e. the number of pairs of words of length $n$ with levenshtein distance $n$, because the automaton is deterministic.

The automaton has a sink state, that we may remove to very slightly simplify the calculation, since it does not contribute to the accepted words. Then all the remaining states are accepting.

Here is GAP code for these calculations:

nfa := levnfa(4); # for k=4

dfa := RemovedSinkStates(MinimalizedAut(nfa));

size := NumberStatesOfAutomaton(dfa);

mat := NullMat(size, size);;
for row in TransitionMatrixOfAutomaton(dfa) do
  for i in [1..size] do
    if row[i]<>0 then
      mat[i][row[i]] := mat[i][row[i]]+1;
    fi;
  od;
od;

init := ListWithIdenticalEntries(size, 0);;
init[InitialStatesOfAutomaton(dfa)[1]] := 1;;

Assert(0, FinalStatesOfAutomaton(dfa)=[1..size]);
fin := ListWithIdenticalEntries(size, 1);;

Now we can compute the number of pairs of words of length $n$ with distance at most $k$ with init * mat^n * fin, but if we want that for a lot of $n$, and especially if $k$ and therefore the size of the matrix gets bigger, it is more efficient to do something like this:

res:=[];; v:=init;;
for i in [1..150] do
  v:=v*mat;
  Add(res, v*fin);
od;

To find a closed formula, we need the eigenvalues of the matrix $M$. From the GAP computation

gap> Set(Factors(CharacteristicPolynomial(Rationals,mat)));
[ x_1-2, x_1-1, x_1, x_1+1, x_1^2+1, x_1^2+x_1+1 ]

(GeneralizedEigenvalues(Rationals,mat); computes the same but takes longer) we see that the eigenvalues are $2, 1, 0, -1, \pm i$ and the third roots of unity.

We can use the index of the nilpotent part of the Jordan-Chevalley decomposition minus one as bound for the degree of the polynomials that appear in the closed formula, and as bound for the values for which it works, but for bigger $k$ that seems to be too much work and we can just guess some number. Also, if we do the work to compute the decomposition, it should be possible to directly get the formula from it. Anyway, GAP allows us to compute:

gap> j:=JordanDecomposition(mat);;
gap> IsZero(j[2]^4);
false
gap> IsZero(j[2]^5);
true

Now let's find out how to express the function as linear combination of exponential functions (for nonzero real eigenvalues) and sine and cosine functions (for complex eigenvalues), multiplied with powers of $n$. Let's use powers up to the fifth, even though we expect only fourth powers. The sine function corresponding to the third root of unity is multiplied with $\sqrt{3}$ to get rational results. We need to compute 42 coefficients, to be sure we do the fitting with all the values of $n$ between 10 and 100:

gap> e := x -> n -> x^n;;
gap> ci := n -> RealPart(E(4)^n);;
gap> si := n -> ImaginaryPart(E(4)^n);;
gap> c3 := n -> RealPart(E(3)^n);;
gap> s3 := n -> ImaginaryPart(E(3)^n) * Sqrt(3);;
gap> SolutionMat(TransposedMat(List([10..100],n->Concatenation(
>       List([e(2),e(1),e(-1),ci,si,c3,s3], fn ->
>        List([0..5], exp -> n^exp*fn(n)))))),
>     res{[10..100]});
[ 168264301/345744, -4115011/16464, 19597/336, -161/24, 17/48, 0, 
  -309817/648, -7058/81, -856/81, 449/81, -515/324, 0, 1/72, 0, 0, 0, 0, 0, 
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -50920/194481, 2378/27783, 22/3969, 0, 
  0, 0, -61456/583443, 2566/83349, -10/3969, 0, 0, 0 ]

We succeeded, and we see that the eigenvalues $\pm i$ do not contribute to the result. After division by $2^n$, we get the result

$$\mathbb{E}(g_4(X_n))=p_2(n)+2^{-n}\Biggl(p_1(n)+(-1)^n p_{-1}(n)+ \cos\left(\frac{2\pi n}{3}\right)p_c(n)+\sqrt{3}\sin\left(\frac{2\pi n}{3}\right)p_s(n)\Biggr) $$

where

\begin{align} p_2(n) &= \frac{17}{48}n^4-\frac{161}{24}n^3+\frac{19597}{336}n^2- \frac{4115011}{16464}n+\frac{168264301}{345744}, \\ p_1(n) &= -\frac{515}{324}n^4+\frac{449}{81}n^3-\frac{856}{81}n^2- \frac{7058}{81}n-\frac{309817}{648}, \\ p_{-1}(n) &= \frac{1}{72}, \\ p_c(n) &= \frac{22}{3969}n^2+\frac{2378}{27783}n-\frac{50920}{194481}, \\ p_s(n) &= -\frac{10}{3969}n^2+\frac{2566}{83349}n-\frac{61456}{583443}. \end{align}

By comparison we see that the formula works for $n\geq 5$.

Bigger $k$

When we do the same for $k=5$, we get no new eigenvalues, but now we get a contribution from the $\pm i$ eigenvalues. Amazingly, the polynomial corresponding to the eigenvalue 1 has degree 6. And the formula only works for $n\geq 8$, so it misses some of the nontrivial cases.

For even bigger $k$ we get more roots of unity as eigenvalues. The computations get more difficult: the size of the NFA grows faster than $2^{k/2}$, and the DFA is much bigger. The matrix $M$ has dimension $N\times N$ where $N$ is the size of the DFA. But it is a sparse matrix with less than $4N$ nonzero entries.

The following GAP function computes a sparse representation of the matrix directly from the transition matrix of the automaton and writes it to a file named sparse:

save_sparse := function(tm)
  local os, size, row, i, st, states;
  size:=Length(tm[1]);
  os:=OutputTextFile("sparse",false);
  AppendTo(os, size, " ", size, " ", "M\n");
  for i in [1..size] do
    states := Filtered(List([1..4], n -> tm[n][i]), s -> s<>0);
    for st in Set(states) do
      AppendTo(os, i, " ", st, " ", Number(states, s-> s=st), "\n");
    od;
  od;
  AppendTo(os, "0 0 0\n");
  CloseStream(os);
end;

It can be used like this:

save_sparse(TransitionMatrixOfAutomaton(dfa));

I could find the eigenvalues for bigger $k$ by computing the minimal polynomial of the matrix using LinBox.

The following table gives the sizes of the automata and the new eigenvalues for $k\leq 10$:

 k   #NFA    #DFA    new eigenvalues
 0      1       1          2
 1      2       2          -
 2      7      15        1,0,-1
 3     12      38          -
 4     25     265      E(3),E(4)
 5     38     700          -
 6     67    4389      E(5),E(6)
 7     96   11856          -
 8    157   64905      E(7),E(8)
 9    218  175766          -
10    343  859265      E(9),E(10)

where E(n) denotes all primitive $n$-th roots of unity.

Best Answer

Related Solutions

Generating function for distribution of 12 balls into 10 cells

Average number of strings with edit distance at most 4

Bigger $k$

Related Question