Solved – Distribution of the Levenshtein distance between two random strings

The Levenshtein or edit distance between two strings is the minimum number of edits (adding a letter, removing a letter or changing a letter) required to transform one into the other.

Assume that we have two strings, each of size $n$ and that consist of letters drawn uniformly IID from an alphabet of $k$ letters. What is the (possibly approximate) distribution of their Levenshtein distance?

This can be seen as an extreme value problem since the Levenshtein distance is the minimum number of mismatches over all the possible ways to align the two sequences. For an alignment chosen randomly among those with $m$ insertions and $m$ deletions, the number of mismatches has a Binomial distribution with parameters $n-m$ and $1-1/k$, so it is tempting to compute the distribution of the minimum of all those Binomial variables (as in this question), but in this case, the Binomial variables are not IID.

A solution to this problem would be useful in biology (where $k=4$ in the case of DNA sequences) because it would give the statistical properties of the alignment between unrelated sequences.

# dependencies library(ggplot2); theme_set(theme_classic()) library(parallel) library(RecordLinkage) # settings alphabet <- c("A", "C", "G", "T") Nsim <- 1e3 read_lengths <- seq(60, 500, 20) # function to create a random string of length "n" using letters of the alphabet "alph" random_read <- function(n, alph=alphabet) paste(sample(alph, size=n, replace=T), collapse="") # simulate res <- mclapply(read_lengths, function(N) replicate(Nsim, levenshteinDist(random_read(N), random_read(N))), mc.cores=6) # arrange results as data.frame res_df <- data.frame(dist=unlist(res), length=rep(read_lengths, sapply(res, length))) # plot densities ggplot(res_df, aes(x=dist / length, col=length, group=length)) + geom_density() + ggtitle("Distribution of Levenshtein distance / length") ggplot(res_df, aes(x=length, y=dist / length, col=length)) + geom_violin(aes(group=length)) + geom_smooth(col="black", lwd=1) + ggtitle("Distribution of Levenshtein distance / length")

Best Answer

I guess it is not a direct answer, but you can try to simulate the scenario and check the empirical distribution to have a rough idea (R code below).

enter image description here

So it seems that for strings longer than ~ 100 the distribution is symmetric and quite narrow around .53 times the length of the string.

Best Answer

Related Solutions

Solved – Distance between random variables

Solved – Lower Bound on the Total Variation Distance between two Binomials

Related Question