[Math] Average Hamming distance between strings after some number of random substitutions in a population of initially identical elements

co.combinatoricspr.probability

Let's say I have a set $S$, $(s_1, …, s_i, …, s_P) \in S$, of $P$ identical strings over a $k$-letter alphabet, each of length $|s_i| = L$. With uniform random probability across all strings in $S$ (and all string positions in any $s_i$), I randomly substitute one character for another. And I do so $N$ times.

For $N >> 1$, all $s_i$ will approximate random sequences. But what is the average Hamming distance between any two strings as a function of $N$?

Best Answer

At each move, I assume you choose one of the character positions in one of the strings (with equal probabilities for all), and replace the character in that position by a randomly chosen character (with equal probabilities for all - note that this allows the possibility that the new character is the same as the old one). Let $X(n)$ be the event that after $n$ moves, the $i$'th character in string $j_1$ is the same as the $i$'th character in string $j_2$. Now if in move $n$ the position chosen was anything other than the $i$'th character in string $j_1$ or $j_2$, $X(n) = X(n-1)$, while if it was either of those, $X(n)$ has probability $1/k$. Thus ${\rm P}(X(n)) = (1 - \frac{2}{PL}) {\rm P}(X(n-1)) + \frac{2}{PLk}$ with ${\rm P}(X(0)) = 1$. The solution of this recurrence is ${\rm P}(X(n)) = \frac{1}{k} + \frac{k-1}{k} \left( 1-\frac{2}{PL} \right)^n$. The expected Hamming distance between strings $j_1$ and $j_2$ after $n$ moves is $L (1 - {\rm P}(X(n)))$.

(added in response to unknown(yahoo)'s further question): if the new character must be different from the existing one at that position, the recurrence becomes ${\rm P}(X(n)) = (1 - \frac{2}{PL}) {\rm P}(X(n-1)) + \frac{2}{PL} \frac{1-P(X(n-1))}{k-1}$, and the solution is ${\rm P}(X(n)) = \frac{1}{k} + \frac{k-1}{k} \left(1 - \frac{2k}{PL(k-1)}\right)^n$. Again the expected Hamming distance after $n$ moves is $L (1 - {\rm P}(X(n)))$.

Related Solutions

[Math] The expected minimum Hamming distance within a set of randomly selected binary strings

The expected value of the distance is the sum of the probabilities that the distance is greater than $d$ for $d=0,1,2, ...$. Unfortunately, it's hard to determine the exact value of this probability. Whether this is greater than $0$ determines whether a code of that size exists.

There are some easy bounds on the probabilities which translate to bounds on the expected minimum Hamming distance. Take any ordering of the points, and check whether each point is consistent with the previous points. As an upper bound, the balls of radius $\lfloor d/2 \rfloor$ about the previous points must be disjoint, and the current point must not be contained in these balls. As a lower bound, each previous choice rules out at most a ball of radius $d$ new points.

Let $B(L,d,q) = \sum_{i=0}^d {L \choose i}(q-1)^i$ be the size of the ball of radius $d$ with an alphabet of size $q$. Let $Pr(L,d,q,P)$ be the probability that $P$ random points are at distances strictly greater than $d$ from each other.

$$ \prod_{i=1}^{P-1} \max(0,1-\frac{iB(L,d,q)}{q^L}) \le Pr(L,d,q,P) \le \prod_{i=1}^{P-1} \max(0,1-\frac{iB(L,\lfloor d/2 \rfloor, q)}{q^L}).$$

The upper bound on the expected minimum Hamming distance from summing the probabilities is precisely twice the lower bound, so these bounds get you within a factor of $2$. Actually, that upper bound on the probabilities is so weak that a better bound in many situations is $Pr(L,d,q,P) \le (1-B(L,d,q)/q^L)^{P-1}$. You get that by saying that the second and subsequent points have to avoid the ball of radius $d$ about the first point.

For example, suppose you take $q=2, L=20, P=10.$ The lower bound on the expected minimum Hamming distance is $5.10246$. If you use the second upper bound when it is better, the upper bound on the expected minimum distance is $6.71209.$

[Math] Estimating the distribution of minimal hamming distances within a set of strings

There are two natural meanings for the distribution of minimum distances. One is the probability distribution for the minimum distance to one point. Another is the multiset of minimum distances to each point. They are related. This answer addresses the first sense of distribution. For a partial answer to the second, see Minimum Hamming Distance Distribution in a Random Subset of Binary Vectors+.

The cumulative distribution of the minimum of $s$ IID random variables satisfies

$$1-\operatorname{CDF}_s(d) = (1-\operatorname{CDF}_1(d))^s$$

since each side is the probability that all $s$ variables are greater than $d$.

If you assume that the symbols are uniformly random and independent, then the distance of a given string to a fixed string follows a binomial distribution. The probability that a particular string will be within a distance $d$ is the cumulative distribution function for the binomial distribution,

$$\operatorname{CDF}_1(d) = \sum_{k=0}^d {n\choose k}p^k (1-p)^{n-k}$$

where $n$ is the length of the string, and $p$ is $3/4$ or $1-\frac{1}{|\textrm{alphabet}|}$.

There isn't a closed form expression for this, but you can estimate it in many ways. Some techniques are better at estimating the cumulative distribution function near the median, and some are better for the tails. See Lower bound for sum of binomial coefficients? and the techniques in Sum of 'the first k' binomial coefficients for fixed n . For many particular values, you can use an exact computation of the distribution rather than estimating it. For example, for $s=1000, n=8,p=3/4,$ here is some code that computes the cumulative distribution function values:

cdf1[d_, n_, p_] := Sum[Binomial[n, k] p^k (1 - p)^(n - k), {k, 0, d}]
cdfs[d_, n_, p_, s_] := 1 - (1 - cdf1[d, n, p])^s
Table[N[cdfs[d, 8, 3/4, 1000]], {d, 0, 8}]

$0.01514,0.3172,0.9855,1.0000,1.0000,...$.

Best Answer

Related Solutions

[Math] The expected minimum Hamming distance within a set of randomly selected binary strings

[Math] Estimating the distribution of minimal hamming distances within a set of strings

Related Question