[Math] The expected minimum Hamming distance within a set of randomly selected binary strings

co.combinatoricspr.probability

If I randomly sample with replacement $P$ times from a set of all possible binary strings of length $L$, what is a good lowerbound on the expected minimum Hamming distance between any two of my $P$ strings? Can we generalize this for larger ternary/etc. string alphabets?

Best Answer

The expected value of the distance is the sum of the probabilities that the distance is greater than $d$ for $d=0,1,2, ...$. Unfortunately, it's hard to determine the exact value of this probability. Whether this is greater than $0$ determines whether a code of that size exists.

There are some easy bounds on the probabilities which translate to bounds on the expected minimum Hamming distance. Take any ordering of the points, and check whether each point is consistent with the previous points. As an upper bound, the balls of radius $\lfloor d/2 \rfloor$ about the previous points must be disjoint, and the current point must not be contained in these balls. As a lower bound, each previous choice rules out at most a ball of radius $d$ new points.

Let $B(L,d,q) = \sum_{i=0}^d {L \choose i}(q-1)^i$ be the size of the ball of radius $d$ with an alphabet of size $q$. Let $Pr(L,d,q,P)$ be the probability that $P$ random points are at distances strictly greater than $d$ from each other.

$$ \prod_{i=1}^{P-1} \max(0,1-\frac{iB(L,d,q)}{q^L}) \le Pr(L,d,q,P) \le \prod_{i=1}^{P-1} \max(0,1-\frac{iB(L,\lfloor d/2 \rfloor, q)}{q^L}).$$

The upper bound on the expected minimum Hamming distance from summing the probabilities is precisely twice the lower bound, so these bounds get you within a factor of $2$. Actually, that upper bound on the probabilities is so weak that a better bound in many situations is $Pr(L,d,q,P) \le (1-B(L,d,q)/q^L)^{P-1}$. You get that by saying that the second and subsequent points have to avoid the ball of radius $d$ about the first point.

For example, suppose you take $q=2, L=20, P=10.$ The lower bound on the expected minimum Hamming distance is $5.10246$. If you use the second upper bound when it is better, the upper bound on the expected minimum distance is $6.71209.$