Compute the expected minimum Hamming distance with 3 strings

probability

If we sample $3$ binary strings of length $n$, uniformly and independently, what is the expected minimum Hamming distance between the closest pair?

Numerically, it seems to be asymptotic to $n/2$ but it would be great to know if there is an exact formula.

Best Answer

I realized that my original answer was unnecessarily complicated. I’m preserving it below, but here’s a more efficient approach.

As below, without loss of generality fix the first string to be all zeros. Now consider the counts $a_{ij}$ of positions in which the second and third string have the values $i$ and $j$, respectively. In contrast to the variables in the original solution, these four variables are all on an equal footing. They are all close to $\frac n4$. The Hamming distances are

\begin{eqnarray} h_{12}&=&a_{10}+a_{11}\;,\\ h_{13}&=&a_{01}+a_{11}\;,\\ h_{23}&=&a_{01}+a_{10}\;, \end{eqnarray}

and $s=\frac12\sum_{ij}a_{ij}$ and $\Delta_{ab}=h_{ab}-s$ form an orthogonal basis of the space. The count of strings is

\begin{eqnarray} 2^{-2n}\binom n{a_{00},a_{01},a_{10},a_{11}} &=& 2^{-2n}\frac{n!}{\prod_{ij}a_{ij}!} \\ &\approx& 2^{-2n}\frac{\sqrt{2\pi n}}{\prod_{ij}\sqrt{2\pi a_{ij}}}\exp\left(n\log n-\sum_{ij}a_{ij}\log a_{ij}\right)\;. \end{eqnarray}

As below, we can take the square roots to be constant, so they yield a factor $2^4(2\pi n)^{-\frac32}$. With

\begin{eqnarray} 2a_{00}&=&s-\Delta_{12}-\Delta_{13}-\Delta_{23}\;,\\ 2a_{01}&=&s-\Delta_{12}+\Delta_{13}+\Delta_{23}\;,\\ 2a_{10}&=&s+\Delta_{12}-\Delta_{13}+\Delta_{23}\;,\\ 2a_{11}&=&s+\Delta_{12}+\Delta_{13}-\Delta_{23}\;,\\ \end{eqnarray}

we get

\begin{eqnarray} && 2^{-2n}2^4(2\pi n)^{-\frac32}\iiiint\prod_{ij}\mathrm da_{ij}\delta\left(\sum_{ij}a_{ij}-n\right)\min(h_{12},h_{13},h_{23})\exp\left(n\log n-\sum_{ij}a_{ij}\log a_{ij}\right) \\ &\approx& 2^4(2\pi n)^{-\frac32}\iiiint\mathrm d\Delta_{12}\mathrm d\Delta_{13}\mathrm d\Delta_{23}\mathrm ds\delta(2s-n)\left(\frac n2+\min(\Delta_{12},\Delta_{13},\Delta_{23})\right)\exp\left(-\frac1{2n}\right. \\ && \left.\vphantom{\frac1{2n}}\left((-\Delta_{12}-\Delta_{13}-\Delta_{23})^2+(-\Delta_{12}+\Delta_{13}+\Delta_{23})^2+(\Delta_{12}-\Delta_{13}+\Delta_{23})^2+(\Delta_{12}+\Delta_{13}-\Delta_{23})^2\right)\right) \\ &=& \frac n2+2^3(2\pi n)^{-\frac32}\iiint\mathrm d\Delta_{12}\mathrm d\Delta_{13}\mathrm d\Delta_{23} \min(\Delta_{12},\Delta_{13},\Delta_{23})\exp\left(-\frac2n\left(\Delta_{12}^2+\Delta_{13}^2+\Delta_{23}^2\right)\right) \\ &=& \frac n2-\frac34\sqrt{\frac n\pi}\;, \end{eqnarray}

where the last integral is evaluated as below. This treatment should lend itself more readily to generalization to higher $N$.

Original answer:

Without loss of generality fix the first string to be all zeros. The second one has probability $2^{-n}\binom nm$ to have $m$ ones, and thus to have Hamming distance $m$ from the first string. The third one has probability

$$ 2^{-n}\binom mk\binom{n-m}l $$

to have $k$ zeros where the second string has ones and $l$ ones where the second string has zeros, and thus to have Hamming distance $k+l$ from the second string and $m-k+l$ from the first string. Thus the mean minimum distance is

$$ 2^{-2n}\sum_{m=0}^n\sum_{k=0}^m\sum_{l=0}^{n-m}\binom nm\binom mk\binom{n-m}l\min\left(m,k+l,m-k+l\right)\\=2^{-2n}\sum_{m=0}^n\sum_{k=0}^m\sum_{l=0}^{n-m}\frac{n!}{k!(m-k)!l!(n-m-l)!}\min\left(m,k+l,m-k+l\right)\;.$$

For large $n$, all three distances will be close to $\frac n2$, so $m\approx\frac n2$ and $k\approx\frac n4$, $l\approx\frac n4$. We can approximate the factorials and replace the bounded sums by unbounded integrals to obtain

$$ 2^{-2n}\int_{-\infty}^\infty\mathrm dm\int_{-\infty}^\infty\mathrm dk\int_{-\infty}^\infty\mathrm dl\min\left(m,k+l,m-k+l\right)\frac{\sqrt{2\pi n}}{\sqrt{2\pi k}\sqrt{2\pi(m-k)}\sqrt{2\pi l}\sqrt{2\pi (n-m-l)}}\\\exp\left(n\log n-k\log k-(m-k)\log(m-k)-l\log l-(n-m-l)\log(n-m-l)\right)\;. $$

With $m=\left(\frac12+\mu\right)n$, $k=\left(\frac14+\kappa\right)n$ and $l=\left(\frac14+\lambda\right)n$ this is

$$ 2^{-2n}\left(\frac n{2\pi}\right)^\frac32\int_{-\infty}^\infty\mathrm d\mu\int_{-\infty}^\infty\mathrm d\kappa\int_{-\infty}^\infty\mathrm d\lambda \left(\frac12+\min\left(\mu,\kappa+\lambda,\mu-\kappa+\lambda\right)\right)n \\ \frac1{\sqrt{\frac14+\kappa}\sqrt{\frac14+\mu-\kappa}\sqrt{\frac14+\lambda}\sqrt{\frac14-\mu-\lambda}} \\ \exp\left(-n\left(\left(\frac14+\kappa\right)\log\left(\frac14+\kappa\right)+\left(\frac14+\mu-\kappa\right)\log\left(\frac14+\mu-\kappa\right)\right.\right. \\ \left.\left.+\left(\frac14+\lambda\right)\log\left(\frac14+\lambda\right)+\left(\frac14-\mu-\lambda\right)\log\left(\frac14-\mu-\lambda\right)\right)\right) \\ \approx \frac n2+n\cdot2^4\left(\frac n{2\pi}\right)^\frac32\int_{-\infty}^\infty\mathrm d\mu\int_{-\infty}^\infty\mathrm d\kappa\int_{-\infty}^\infty\mathrm d\lambda\min\left(\mu,\kappa+\lambda,\mu-\kappa+\lambda\right) \\ \exp\left(-2n\left(\kappa^2+(\mu-\kappa)^2+\lambda^2+(\mu+\lambda)^2\right)\right) $$

(where we can take the square roots in the denominator to be constant because their linear terms cancel). This is a Gaussian integral with covariance matrix

$$ 4n\pmatrix{2&-1&1\\-1&2&0\\1&0&2}\;, $$

which has eigenvalues $4n\left(2+\sqrt2\right)$, $4n\cdot2$ and $4n\left(2-\sqrt2\right)$ and corresponding orthonormal eigenvectors $\left(\frac1{\sqrt2},-\frac12,\frac12\right)$, $\left(0,\frac1{\sqrt2},\frac1{\sqrt2}\right)$ and $\left(-\frac1{\sqrt2},-\frac12,\frac12\right)$. We can check at this point that the integral without the minimum Hamming distance is $1$, so the approximations have preserved the normalization.

By symmetry, we can evaluate the part of the integral where the minimum is $\mu$ and multiply by $3$. With the transformation

$$ \pmatrix{\mu\\\kappa\\\lambda}=\pmatrix{ \frac1{\sqrt2}&0&-\frac1{\sqrt2}\\ -\frac12&\frac1{\sqrt2}&-\frac12\\ \frac12&\frac1{\sqrt2}&\frac12 } \operatorname{diag}\left(4n\left(2+\sqrt2\right),4n\cdot2,4n\left(2-\sqrt2\right)\right)^{-\frac12} \pmatrix{x\\y\\z} $$

that transforms the covariance matrix to the identity, the boundary planes $\mu\lt\kappa+\lambda$ and $\mu\lt\mu-\kappa+\lambda$, that is, $\kappa\lt\lambda$, become

$$\sqrt{2-\sqrt2}\cdot x-\sqrt{2+\sqrt2}\cdot z\lt2y$$

and

$$\sqrt{2-\sqrt2}\cdot x+\sqrt{2+\sqrt2}\cdot z\gt0\;,$$

respectively. The $\mu$ in the integrand becomes $\frac1{4\sqrt n}\left(\sqrt{2-\sqrt2}\cdot x-\sqrt{2+\sqrt2}\cdot z\right)$. It makes sense to rotate to

$$ \pmatrix{u\\v}=\frac12\pmatrix{ \sqrt{2-\sqrt2}&-\sqrt{2+\sqrt2} \\ \sqrt{2+\sqrt2}&\sqrt{2-\sqrt2} }\pmatrix{x\\z} $$

so that the bounds are $u\lt y$ and $u\lt v$, respectively, and the factor $\mu$ in the integrand is $\frac u{2\sqrt n}$. Note that the bounds are now manifestly symmetric and include the third of the solid angle in which $u$ is the least of $u,v,y$. We can now evaluate the integral in spherical coordinates $y=r\cos\theta$, $u=r\sin\theta\cos\phi$ and $v=r\sin\theta\sin\phi$:

\begin{eqnarray} && n\cdot(2\pi)^{-\frac32}\iiint\limits_{u\lt\min(v,y)}\frac u{2\sqrt n}\mathrm e^{-\frac12\left(u^2+v^2+y^2\right)}\mathrm du\,\mathrm dv\,\mathrm dy \\ &=& \frac{\sqrt n}2\cdot(2\pi)^{-\frac32}\int_0^\infty\mathrm e^{-\frac12r^2}r^3\mathrm dr\int_{\frac\pi4}^{\frac{5\pi}4}\int_0^{\operatorname{arccot}\cos\phi}\sin^2\theta\mathrm d\theta\cos\phi\mathrm d\phi \\ &=& \frac{\sqrt n}2\cdot(2\pi)^{-\frac32}\int_0^\infty\mathrm e^{-\frac12r^2}r^3\mathrm dr\int_{\frac\pi4}^{\frac{5\pi}4}\frac12\left(\operatorname{arccot}\cos\phi-\frac{\cos\phi}{1+\cos^2\phi}\right)\cos\phi\mathrm d\phi \\ &=& \frac{\sqrt n}2\cdot(2\pi)^{-\frac32}\int_{\frac\pi4}^{\frac{5\pi}4}\left(\operatorname{arccot}\cos\phi-\frac{\cos\phi}{1+\cos^2\phi}\right)\cos\phi\mathrm d\phi \\ &=& \frac{\sqrt n}2\cdot(2\pi)^{-\frac32}\left(\pi\left(1-\frac3{\sqrt2}\right)-\pi\left(1-\frac1{\sqrt2}\right)\right) \\ &=& -\frac14\sqrt\frac n\pi\;. \end{eqnarray}

We need to multiply this by $3$ and add it to the main term $\frac n2$ to obtain

$$ \boxed{\frac n2-\frac34\sqrt\frac n\pi}\;. $$

Here’s code that performs a simulation for $n=64$ to check the result. The simulation yields a mean minimum Hamming distance of $28.575$, compared to

$$ \frac{64}2-\frac34\sqrt\frac{64}\pi=32-\frac6{\sqrt\pi}\approx28.615 $$

from the asymptotic analysis.

Best Answer

Original answer:

Related Solutions

[Math] the average Levenshtein distance between two random binary strings of length $L$

Average number of strings with edit distance exactly 2

Related Question