[Math] Average Hamming distance between strings after some number of random substitutions in a population of initially identical elements

co.combinatoricspr.probability

Let's say I have a set $S$, $(s_1, …, s_i, …, s_P) \in S$, of $P$ identical strings over a $k$-letter alphabet, each of length $|s_i| = L$. With uniform random probability across all strings in $S$ (and all string positions in any $s_i$), I randomly substitute one character for another. And I do so $N$ times.

For $N >> 1$, all $s_i$ will approximate random sequences. But what is the average Hamming distance between any two strings as a function of $N$?

Best Answer

At each move, I assume you choose one of the character positions in one of the strings (with equal probabilities for all), and replace the character in that position by a randomly chosen character (with equal probabilities for all - note that this allows the possibility that the new character is the same as the old one). Let $X(n)$ be the event that after $n$ moves, the $i$'th character in string $j_1$ is the same as the $i$'th character in string $j_2$. Now if in move $n$ the position chosen was anything other than the $i$'th character in string $j_1$ or $j_2$, $X(n) = X(n-1)$, while if it was either of those, $X(n)$ has probability $1/k$. Thus ${\rm P}(X(n)) = (1 - \frac{2}{PL}) {\rm P}(X(n-1)) + \frac{2}{PLk}$ with ${\rm P}(X(0)) = 1$. The solution of this recurrence is ${\rm P}(X(n)) = \frac{1}{k} + \frac{k-1}{k} \left( 1-\frac{2}{PL} \right)^n$. The expected Hamming distance between strings $j_1$ and $j_2$ after $n$ moves is $L (1 - {\rm P}(X(n)))$.

(added in response to unknown(yahoo)'s further question): if the new character must be different from the existing one at that position, the recurrence becomes ${\rm P}(X(n)) = (1 - \frac{2}{PL}) {\rm P}(X(n-1)) + \frac{2}{PL} \frac{1-P(X(n-1))}{k-1}$, and the solution is ${\rm P}(X(n)) = \frac{1}{k} + \frac{k-1}{k} \left(1 - \frac{2k}{PL(k-1)}\right)^n$. Again the expected Hamming distance after $n$ moves is $L (1 - {\rm P}(X(n)))$.

Related Question