Solved – Distribution of the Levenshtein distance between two random strings

approximationdistancedistributions

The Levenshtein or edit distance between two strings is the minimum number of edits (adding a letter, removing a letter or changing a letter) required to transform one into the other.

Assume that we have two strings, each of size $n$ and that consist of letters drawn uniformly IID from an alphabet of $k$ letters. What is the (possibly approximate) distribution of their Levenshtein distance?

This can be seen as an extreme value problem since the Levenshtein distance is the minimum number of mismatches over all the possible ways to align the two sequences. For an alignment chosen randomly among those with $m$ insertions and $m$ deletions, the number of mismatches has a Binomial distribution with parameters $n-m$ and $1-1/k$, so it is tempting to compute the distribution of the minimum of all those Binomial variables (as in this question), but in this case, the Binomial variables are not IID.

A solution to this problem would be useful in biology (where $k=4$ in the case of DNA sequences) because it would give the statistical properties of the alignment between unrelated sequences.

Best Answer

I guess it is not a direct answer, but you can try to simulate the scenario and check the empirical distribution to have a rough idea (R code below).

enter image description here

enter image description here

So it seems that for strings longer than ~ 100 the distribution is symmetric and quite narrow around .53 times the length of the string.

# dependencies

library(ggplot2); theme_set(theme_classic())
library(parallel)
library(RecordLinkage)

# settings

alphabet <- c("A", "C", "G", "T")
Nsim <- 1e3
read_lengths <- seq(60, 500, 20)


# function to create a random string of length "n" using letters of the alphabet "alph"

random_read <- function(n, alph=alphabet) paste(sample(alph, size=n, replace=T), collapse="")

# simulate

res <- mclapply(read_lengths,
                function(N) replicate(Nsim, levenshteinDist(random_read(N), random_read(N))),
                mc.cores=6)

# arrange results as data.frame

res_df <- data.frame(dist=unlist(res),
                     length=rep(read_lengths, sapply(res, length)))

# plot densities

ggplot(res_df,
       aes(x=dist / length, col=length, group=length)) +
  geom_density() +
  ggtitle("Distribution of Levenshtein distance / length")


ggplot(res_df,
       aes(x=length, y=dist / length, col=length)) +
  geom_violin(aes(group=length)) +
  geom_smooth(col="black", lwd=1) +
  ggtitle("Distribution of Levenshtein distance / length")
Related Question