[Math] Expected edit distance

co.combinatoricscomputer sciencepr.probability

The edit or Levenshtein distance between two strings is the minimum number of single symbol insertions, deletions and substitutions to transform one string into another. For example $$\operatorname{E}(01010,00100)=2.$$

Let $E_n$ be a random variable giving the edit distance between two random binary strings of length $n$.

What bounds can be found for

$$\lim_{n \to \infty} \frac{\mathbb{E}(E_n)}{n} = c\;?$$

According to my non-extensive simulations, $c \approx 0.288$.

There is related work by Luecker on the expected length of the longest common subsequence which has lower and upper bounds of $0.788071n$ and $0.826280n$. This was a line of work originally started by Chvatal and Sankoff. However the longest common subsequence length and $n$ minus the edit distance need not be similar.

Best Answer

The only rigorous bound I am aware of is due to Gonzalo Navarro*

$$c\geq 1-{\rm e}/\sqrt{\sigma},$$

for an alphabet of $\sigma$ characters. Obviously, for the binary string ($\sigma=2$) this bound is ineffective. Navarro also mentions a large-$\sigma$ conjecture $c=1-1/\sqrt{\sigma}$, which for the binary string would give $c=0.2929$, quite close to your numerical finding.

*G. Navarro, A guided tour to approximate string matching (2001)