Instead of considering one string and ask about strings close to it,
it's easier to take a more symmetric approach and ask for the number
of pairs of strings of given length $n$ that have levenshtein distance at
most 4. Then the average number of strings of distance at most 4 is that number
divided by $2^n$. The counting will be done using finite automata.
To motivate the automaton, let's consider a nondeterministic algorithm
to compute the levenshtein distance.
I imagine a machine with two input tapes, initially at the first letter,
and a counter, initially set to 0. The machine compares the letters at
the position of the heads of the tapes, and does one of the following steps:
move both reading heads forward and, if the read letters were different,
increment the counter, or just move one head forward and increment the counter.
(A head cannot move forward when it has already passed the end of the word.)
When both heads are behind the words, the computation ends and the value
of the counter is the result. I claim that this result is an upper bound
to the levenshtein distance of the input words, and that there is a computation
that returns exactly the distance.
I'll formalize this description with pseudocode where I replace the tapes with
variables containing the remaining, unprocessed string. Note that I only
look at its first element (head
), and only replace it with a version
where this element is removed (tail
):
algorithm levdist(l,r):
# nondeterministically compute an upper bound for the levenshtein distance
# in a why where the real distance is among the possible results
c := 0;
while l and r are not both empty
do one of these operations:
- a:=head(l); b:=head(r); l:=tail(l); r:=tail(r); if a<>b then c:=c+1;
- l:=tail(l); c:=c+1;
- r:=tail(r); c:=c+1;
return c;
Let's now simulate this computation with a nondeterministic (infinite)
automaton over the alphabet of pairs of bits, such that the input
$(l_1,r_1),(l_2,r_2)\dots(l_n,r_n)$ corresponds to input
$(l_1l_2\dots l_n,r_1r_2\dots r_n)$ to the original machine.
Note that now the words are of the same length, a restriction we didn't
need before.
The first machine can compare letters $a_i$ and $b_j$ at different positions.
If e.g. $i<j$, the automaton has to read up to $(a_j,b_j)$ to know $b_j$, and
then still must remember $a_i$ (and the following bits) in its state.
So to be able to simulate the machine, the automaton will have states that
know the counter, which (if any) tape is behind, and what the unprocessed
bits on that tape were.
So we have states $N_c$ for $c\geq0$ for when the tapes are in sync,
and states $L_{c,w}$ and $R_{c,w}$ with nonempty words $w$ and $c\geq |w|$,
the length of $w$. To simplify the following, let's allow $L_{c,\epsilon}$
and $R_{c,\epsilon}$ as alternate names for $N_c$, where $\epsilon$ is
the empty word.
For input $(l,r)$, the states $N_c$ have transitions to itself if $l=r$,
otherwise to $N_{c+1}$, and to $L_{c+1,l}$ and $R_{c+1,r}$.
If $\omega$ is a bit and $w=\omega w'$, then for that same input
$L_{c,w}$ has transitions to $L_{c',w'l}$ with $c'=c$ if $\omega=r$, otherwise
$c'=c+1$, and to $L_{c+1,wl}$. It also has an $\epsilon$-transition to
$L_{c+1,w'}$. The transitions of the $R$ state are likewise, with the roles of
$l$ and $r$ swapped.
When the automaton starts with initial state $N_0$ and processes some input,
it will end in some state from which an $N_c$ state can be reached. This $c$
is an upper bound for the levenshtein distance of the input. Each computation
of the first machine corresponds to a computation of the automaton that ends
in the $N_c$ state with the $c$ that the machine computes.
Now we don't really want to compute the distance, we only want to know if
it is at most 4, which we'll generalize to $k$.
(And for real computations, we prefer finite automata.)
To achieve this, we can restrict the states to those that have $c\leq k$.
Now some of the $L_{c,w}$ and $R_{c,w}$ states can no longer reach an $N$ state.
When we remove them, we keep those with $1\leq c\leq |w|\leq k-|w|$.
When this automaton processes some input with initial state $N_0$
and doesn't get stuck, we know that it could reach a state $N_c$
(with $c\leq k$), so we can just declare all states as accepting states.
The following GAP code which uses the
automata
package (of which we will use more below) formalizes the construction
of this nondeterministic finite automaton:
LoadPackage("automata");
levnfa := function(k)
local kh, d, c, states, l, w, AddT, dir, hd, tl, c0, c1,
t00, t01, t10, t11, te, e00, e01, x01, e10, x10, e11, ee;
kh:=QuoInt(k,2);
d:=NewDictionary([],true);
# prepare states
for c in [0..k] do
AddDictionary(d,[0,c],c+1);
AddDictionary(d,[1,c],c+1);
od;
states:=k+1;
for l in [1..kh] do
for w in IteratorOfTuples([0,1],l) do
for c in [l..k-l] do
states:=states+1;
AddDictionary(d,Concatenation([0,c],w),states);
states:=states+1;
AddDictionary(d,Concatenation([1,c],w),states);
od;
od;
od;
AddT:=function(sl,arg)
# add state id of state described by arg to sl if it exists,
# arg is [dir, c, w, x] where x is an optional extra bit appended to w
local x,l,val;
if Length(arg)=4 then
x:=[arg[4]];
else
x:=[];
fi;
l:=Concatenation(arg{[1,2]},arg[3],x);
val := LookupDictionary(d,l);
if val<>fail then
Add(sl,val);
fi;
end;
t00:=[]; t01:=[]; t10:=[]; t11:=[]; te:=[];
for c in [0..k] do
e00:=[]; e01:=[]; e10:=[]; e11:=[];
AddT(e00,[0,c,[]]); AddT(e11,[0,c,[]]);
AddT(e01,[0,c+1,[]]); AddT(e10,[0,c+1,[]]);
AddT(e00,[0,c+1,[0]]); AddT(e00,[1,c+1,[0]]);
AddT(e01,[0,c+1,[0]]); AddT(e01,[1,c+1,[1]]);
AddT(e10,[0,c+1,[1]]); AddT(e10,[1,c+1,[0]]);
AddT(e11,[0,c+1,[1]]); AddT(e11,[1,c+1,[1]]);
Add(t00,e00); Add(t01,e01); Add(t10,e10); Add(t11,e11); Add(te,[]);
od;
for l in [1..kh] do
for w in IteratorOfTuples([0,1],l) do
for c in [l..k-l] do
for dir in [0,1] do
e00:=[]; x01:=[]; x10:=[]; e11:=[]; ee:=[];
hd:=w[1];
tl:=w{[2..Length(w)]};
if hd=0 then
c0:=0; c1:=1;
else
c0:=1; c1:=0;
fi;
AddT(e00,[dir,c+c0,tl,0]);
AddT(x01,[dir,c+c1,tl,0]);
AddT(x10,[dir,c+c0,tl,1]);
AddT(e11,[dir,c+c1,tl,1]);
AddT(e00,[dir,c+1,w,0]);
AddT(x01,[dir,c+1,w,0]);
AddT(x10,[dir,c+1,w,1]);
AddT(e11,[dir,c+1,w,1]);
AddT(ee, [dir,c+1,tl]);
if dir=0 then
e01:=x01; e10:=x10;
else
e01:=x10; e10:=x01;
fi;
Add(t00,e00); Add(t01,e01); Add(t10,e10); Add(t11,e11); Add(te,ee);
od;
od;
od;
od;
return Automaton( "epsilon", states, 5,
[t00, t01, t10, t11, te],
[1], [1..states] );
end;
The automaton that we get can be transformed into a minimal deterministic
one. From its transition function we can get a matrix $M$ that contains the
number of ways we can get from one state to another in one step. From the
matrix $M^n$ we can see how many ways there are to get from one state to
another in $n$ steps. The number of ways to get from the initial state to
a terminal one in $n$ steps equals the number of accepted words, i.e. the
number of pairs of words of length $n$ with levenshtein distance $n$,
because the automaton is deterministic.
The automaton has a sink state, that we may remove to very slightly simplify
the calculation, since it does not contribute to the accepted words.
Then all the remaining states are accepting.
Here is GAP code for these calculations:
nfa := levnfa(4); # for k=4
dfa := RemovedSinkStates(MinimalizedAut(nfa));
size := NumberStatesOfAutomaton(dfa);
mat := NullMat(size, size);;
for row in TransitionMatrixOfAutomaton(dfa) do
for i in [1..size] do
if row[i]<>0 then
mat[i][row[i]] := mat[i][row[i]]+1;
fi;
od;
od;
init := ListWithIdenticalEntries(size, 0);;
init[InitialStatesOfAutomaton(dfa)[1]] := 1;;
Assert(0, FinalStatesOfAutomaton(dfa)=[1..size]);
fin := ListWithIdenticalEntries(size, 1);;
Now we can compute the number of pairs of words of length $n$
with distance at most $k$ with init * mat^n * fin
, but if we want that
for a lot of $n$, and especially if $k$ and therefore the size of the matrix
gets bigger, it is more efficient to do something like this:
res:=[];; v:=init;;
for i in [1..150] do
v:=v*mat;
Add(res, v*fin);
od;
To find a closed formula, we need the eigenvalues of the matrix $M$.
From the GAP computation
gap> Set(Factors(CharacteristicPolynomial(Rationals,mat)));
[ x_1-2, x_1-1, x_1, x_1+1, x_1^2+1, x_1^2+x_1+1 ]
(GeneralizedEigenvalues(Rationals,mat);
computes the same but takes longer)
we see that the eigenvalues are
$2, 1, 0, -1, \pm i$ and the third roots of unity.
We can use the index of the nilpotent part of the Jordan-Chevalley
decomposition minus one as bound for the degree of the polynomials that
appear in the closed formula, and as bound for the values for which it works,
but for bigger $k$ that seems to be too much work and we can just guess some
number. Also, if we do the work to compute the decomposition, it should be possible to directly get the formula from it. Anyway, GAP allows us to compute:
gap> j:=JordanDecomposition(mat);;
gap> IsZero(j[2]^4);
false
gap> IsZero(j[2]^5);
true
Now let's find out how to express the function as linear combination
of exponential functions (for nonzero real eigenvalues) and sine and
cosine functions (for complex eigenvalues), multiplied with
powers of $n$. Let's use powers up to the fifth, even though we expect
only fourth powers. The sine function corresponding to the third root of
unity is multiplied with $\sqrt{3}$ to get rational results. We need to
compute 42 coefficients, to be sure we do the fitting with all the values of
$n$ between 10 and 100:
gap> e := x -> n -> x^n;;
gap> ci := n -> RealPart(E(4)^n);;
gap> si := n -> ImaginaryPart(E(4)^n);;
gap> c3 := n -> RealPart(E(3)^n);;
gap> s3 := n -> ImaginaryPart(E(3)^n) * Sqrt(3);;
gap> SolutionMat(TransposedMat(List([10..100],n->Concatenation(
> List([e(2),e(1),e(-1),ci,si,c3,s3], fn ->
> List([0..5], exp -> n^exp*fn(n)))))),
> res{[10..100]});
[ 168264301/345744, -4115011/16464, 19597/336, -161/24, 17/48, 0,
-309817/648, -7058/81, -856/81, 449/81, -515/324, 0, 1/72, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -50920/194481, 2378/27783, 22/3969, 0,
0, 0, -61456/583443, 2566/83349, -10/3969, 0, 0, 0 ]
We succeeded, and we see that the eigenvalues $\pm i$ do not contribute
to the result.
After division by $2^n$, we get the result
$$\mathbb{E}(g_4(X_n))=p_2(n)+2^{-n}\Biggl(p_1(n)+(-1)^n p_{-1}(n)+
\cos\left(\frac{2\pi n}{3}\right)p_c(n)+\sqrt{3}\sin\left(\frac{2\pi n}{3}\right)p_s(n)\Biggr) $$
where
\begin{align}
p_2(n) &= \frac{17}{48}n^4-\frac{161}{24}n^3+\frac{19597}{336}n^2-
\frac{4115011}{16464}n+\frac{168264301}{345744}, \\
p_1(n) &= -\frac{515}{324}n^4+\frac{449}{81}n^3-\frac{856}{81}n^2-
\frac{7058}{81}n-\frac{309817}{648}, \\
p_{-1}(n) &= \frac{1}{72}, \\
p_c(n) &= \frac{22}{3969}n^2+\frac{2378}{27783}n-\frac{50920}{194481}, \\
p_s(n) &= -\frac{10}{3969}n^2+\frac{2566}{83349}n-\frac{61456}{583443}.
\end{align}
By comparison we see that the formula works for $n\geq 5$.
Bigger $k$
When we do the same for $k=5$, we get no new eigenvalues, but now we get a contribution from the $\pm i$ eigenvalues. Amazingly, the polynomial corresponding to the eigenvalue 1 has degree 6. And the formula only works for $n\geq 8$, so it misses some of the nontrivial cases.
For even bigger $k$ we get more roots of unity as eigenvalues. The computations get more difficult: the size of the NFA grows faster than $2^{k/2}$, and the DFA is much bigger. The matrix $M$ has dimension $N\times N$ where $N$ is the size of the DFA. But it is a sparse matrix with less than $4N$ nonzero entries.
The following GAP function computes a sparse representation of the matrix directly from the transition matrix of the automaton and writes it to a file named sparse
:
save_sparse := function(tm)
local os, size, row, i, st, states;
size:=Length(tm[1]);
os:=OutputTextFile("sparse",false);
AppendTo(os, size, " ", size, " ", "M\n");
for i in [1..size] do
states := Filtered(List([1..4], n -> tm[n][i]), s -> s<>0);
for st in Set(states) do
AppendTo(os, i, " ", st, " ", Number(states, s-> s=st), "\n");
od;
od;
AppendTo(os, "0 0 0\n");
CloseStream(os);
end;
It can be used like this:
save_sparse(TransitionMatrixOfAutomaton(dfa));
I could find the eigenvalues for bigger $k$ by computing the minimal polynomial of the matrix using LinBox.
The following table gives the sizes of the automata and the new eigenvalues for $k\leq 10$:
k #NFA #DFA new eigenvalues
0 1 1 2
1 2 2 -
2 7 15 1,0,-1
3 12 38 -
4 25 265 E(3),E(4)
5 38 700 -
6 67 4389 E(5),E(6)
7 96 11856 -
8 157 64905 E(7),E(8)
9 218 175766 -
10 343 859265 E(9),E(10)
where E(n)
denotes all primitive $n$-th roots of unity.
Best Answer
Since you want the length to remain unchanged and $2$ to be the minimal edit distance, the only options are two substitutions in different places, or an insertion and a deletion. (It doesn't matter in which order we carry out the insertion and the deletion.) It's straightforward that there are $\binom n2=\frac{n(n-1)}2$ different results of two substitutions in different places, so the task is to count the strings produced by an insertion and a deletion that can't be produced by at most two substitutions.
Let's count the cases where the insertion is to the left of the deletion and then multiply by $2$. The combined effect of the insertion and the deletion is to shift all $k$ bits between them to the right while replacing the first one and removing the last one. This result can also be achieved by at most $k$ substitutions, so we need $k\gt2$. Inserting $x$ within a run of $x$s has the same effect as inserting $x$ at the end of the run. Thus we can count all insertions with different effects once by always inserting the bit complementary to the one to the right of the insertion. Similarly, a deletion within a run has the same effect as a deletion at the start of the run, so we should only count deletions that follow a change between $0$ and $1$.
That gives us an initial count of
$$ 2\cdot\frac12\sum_{k=3}^n(n+1-k)=\sum_{k=1}^{n-2}k=\frac{(n-1)(n-2)}2\;, $$
which together with $\frac{n(n-1)}2$ from the substitutions yields $(n-1)^2$. That's already of the order of the counts you computed, but a bit too high, so we're overcounting.
If there are no further changes in the $k$ shifted bits other than the one preceding the deletion, then only the bits next to the insertion and deletion change, and we can achieve that with $2$ substitutions, so we have to subtract
$$ \sum_{k=3}^n\left(\frac12\right)^{k-2}(n+1-k)=\sum_{k=1}^{n-2}\left(\frac12\right)^{n-k-1}k=n-3+2^{-(n-2)}\;. $$
Also, if the entire range of shifted bits consists of alternating zeros and ones, then swapping the insertion and the deletion yields the same effect, so in this case we were double-counting and need to subtract
$$ \sum_{k=3}^n\left(\frac12\right)^{k-1}(n+1-k)\;, $$
which is half the previous sum. Thus, the expected number of binary strings of length $n$ at edit distance exactly $2$ from a uniformly randomly selected binary string of length $n$ is
$$ (n-1)^2-\frac32\left(n-3+2^{-(n-2)}\right)=n^2-\frac72n+\frac{11}2-6\cdot2^{-n}\;, $$
in agreement with your computed results.