[Math] Entropy of edit distance

co.combinatoricsentropypr.probability

The edit or Levenshtein distance between two strings is the minimum number of single character insertions, deletions and substitutions to transform one string into another. If we take random binary strings $A$ and $B$ of the same length $n$, is it known what the entropy of the edit distance between $A$ and $B$ is?

Update. I would be happy with a proof that the entropy is asymptotically $c \log n$ for some (unknown) $c>0$ as suggested by Anthony Quas (or even $(\Theta(\log{n})$ which potentially saves having to prove convergence).

Best Answer

This is not a complete answer and may not tell you anything you don't already know, but it's too long for a comment.

You can get an estimate on the number of words at an edit distance of $k$ from a given word of length $n$ by using the arguments in the proof of Lemma 2.6 in this paper. (The lemma is on page 10 and its proof is on page 28.)

The idea is that if a word $w$ is at an edit distance of $k$ from a word $v$, then one can get from $v$ to $w$ via the following steps:

  1. insert $k$ copies of the symbol 'e' (for 'edit') into the word $v$;
  2. one by one, go through the symbols 'e' and either change the symbol immediately before it, delete the symbol immediately before it, or insert a symbol immediately before it, and then delete the 'e'.

Step 1 gives a word of length $n+k$ with $k$ occurrences of the symbol 'e', so there are ${n+k\choose k}$ possibilities after this step. Then step 2 gives at most $3^k$ possible words for each of those possibilities, so for a fixed word $v$ of length $n$ one obtains the bound $$ \#\{w \mid \text{edit distance from $v$ to $w$ is $k$}\} \leq 3^k {n+k \choose k}. $$ For small $k$ this is a reasonable bound; in particular when $k\ll n$ this is not much larger than ${n\choose k}$, the number of words a Hamming distance of $k$ away. The problem is that for larger values of $k$ the procedure described above succumbs to a lot of overcounting, so it's not clear what the actual bound should be.

Related Question