[Math] Estimating the distribution of minimal hamming distances within a set of strings

co.combinatoricspr.probability

Is their an efficient mathematical way to estimate the distribution of minimal hamming distances for a set of random strings of length 8 over a 4-letter alphabet? E.g. given a set of 100-10,000 strings of length 8 over a 4-letter alphabet, determine for each sequence the minimal distance from the rest of the set and construct the overall distribution.

The motivation is bioinformatic e.g. comparing the observed vs. expected distributions to learn about the point-mutation rate. The observed distribution is calculated explicitly from the data but it would be nice to estimate the expected distribution mathematically. I assume that any deviation from a random distribution is caused by point mutations.

This is very related to a previous post which spoke about the expected minimal hamming distance but not the distribution itself.

Best Answer

There are two natural meanings for the distribution of minimum distances. One is the probability distribution for the minimum distance to one point. Another is the multiset of minimum distances to each point. They are related. This answer addresses the first sense of distribution. For a partial answer to the second, see Minimum Hamming Distance Distribution in a Random Subset of Binary Vectors+.

The cumulative distribution of the minimum of $s$ IID random variables satisfies

$$1-\operatorname{CDF}_s(d) = (1-\operatorname{CDF}_1(d))^s$$

since each side is the probability that all $s$ variables are greater than $d$.

If you assume that the symbols are uniformly random and independent, then the distance of a given string to a fixed string follows a binomial distribution. The probability that a particular string will be within a distance $d$ is the cumulative distribution function for the binomial distribution,

$$\operatorname{CDF}_1(d) = \sum_{k=0}^d {n\choose k}p^k (1-p)^{n-k}$$

where $n$ is the length of the string, and $p$ is $3/4$ or $1-\frac{1}{|\textrm{alphabet}|}$.

There isn't a closed form expression for this, but you can estimate it in many ways. Some techniques are better at estimating the cumulative distribution function near the median, and some are better for the tails. See Lower bound for sum of binomial coefficients? and the techniques in Sum of 'the first k' binomial coefficients for fixed n . For many particular values, you can use an exact computation of the distribution rather than estimating it. For example, for $s=1000, n=8,p=3/4,$ here is some code that computes the cumulative distribution function values:

cdf1[d_, n_, p_] := Sum[Binomial[n, k] p^k (1 - p)^(n - k), {k, 0, d}]
cdfs[d_, n_, p_, s_] := 1 - (1 - cdf1[d, n, p])^s
Table[N[cdfs[d, 8, 3/4, 1000]], {d, 0, 8}]

$0.01514,0.3172,0.9855,1.0000,1.0000,...$.

Related Question