Minimum Value Distribution Among Non-Independent Random Variables

cumulative-distribution-functionsintegrationprobability

Now we have several independent and identically distributed random variables following the uniform distribution on the interval [0, 1].They are denoted as $x_1, x_2, x_3, …, x_m$ and $y_1, y_2, …, y_n$ respectively.Let
$$
D_{i,j} = \begin{cases}
x_i-y_j, & x_i\ge y_j \\\\
1+x_i-y_j, & x_i<y_j
\end{cases}
$$

In reality, these points are distributed on a circle with a length of 1 and can only move clockwise.

so
$$
\begin{aligned}
F_{D_{ij}}(d) &= P(D_{ij} \leq d) \\\\
&= P(D \leq d | x_i \geq y_j)P(x_i \geq y_j) + P(D \leq d | x_i < y_j)P(x_i < y_j) \\\\
& = P(x_i-y_j \leq d , x_i \geq y_j) + P(1+x_i-y_j \leq d,x_i < y_j) \\\\
& = \iint\limits_{0\leq x_i-y_j \leq d} f(x_i,y_j)\mathrm{d}x_i \mathrm{d}y_j +\iint\limits_{x_i-y_j \leq d-1,x_i-y_j<0} f(x_i,y_j)\mathrm{d}x_i \mathrm{d}y_j \\\\
& = \frac{1}{2}-\frac{(1-d)^2}{2}+\frac{d^2}{2} \\\\
& = d,\ \ \ 0\leq d\leq 1 \\\\
\end{aligned}
$$

Therefore, $D_{ij}$ also follows a uniform distribution on the interval [0, 1].
However, unfortunately, $D_{ij}$, for all $i$ and $j$, are not independent. For example, suppose $D_{11} = x_1 – y_1$ and $D_{12} = x_1 – y_2$. Then $E(D_{11}D_{12}) = E((x_1 – y_1)(x_1 – y_2)) = E(x_1^2) – E(x_1y_1) – E(x_1y_2) + E(y_1y_2) = \frac{1}{12} \neq 0$.

So it's difficult to determine the distribution function of the random variable $Y = \min(D_{11}, D_{12}, …, D_{1n}, D_{21}, D_{22}, …, D_{2n}, …, D_{m1}, D_{m2}, …, D_{mn})$ because each element inside it is not independent.

Therefore, I have to use the definition of the distribution function to find its expression.

I attempted a small-scale example with only three variables, namely $x_1$, $y_1$, and $y_2$.

so Let $Y = min(D_{11},D_{12})$
$$
\begin{aligned}
F_Y(y) & = P(min(D_{11},D_{12})\leq y) \\\\
& = 1-P(min(D_{11},D_{12})>y) \\\\
& = 1 – P(D_{11}>y,D_{12}>y)
\end{aligned}
$$

So I had to determine the joint probability distribution of $(D_{11}, D_{12})$.
$$
\begin{aligned}
F_{D_{11},D_{12}}(d_{11},d_{12}) &= P(D_{11} \leq d_{11},D_{12} \leq d_{12}) \\\\
&= P(D_{11} \leq d_{11},D_{12} \leq d_{12},x_1\ge y_1,x_1\ge y_2) \\\\
&\ \ \ + P(D_{11} \leq d_{11},D_{12} \leq d_{12},x_1\ge y_1,x_1< y_2) \\\\
&\ \ \ + P(D_{11} \leq d_{11},D_{12} \leq d_{12},x_1< y_1,x_1\ge y_2)\\\\
&\ \ \ + P(D_{11} \leq d_{11},D_{12} \leq d_{12},x_1< y_1,x_1< y_2) \\\\
&= P(x_1-y_1 \leq d_{11},x_1-y_2 \leq d_{12},x_1\ge y_1,x_1\ge y_2) \\\\
&\ \ \ + P(x_1-y_1 \leq d_{11},1+x_1-y_2 \leq d_{12},x_1\ge y_1,x_1< y_2) \\\\
&\ \ \ + P(1+x_1-y_1 \leq d_{11},x_1-y_2 \leq d_{12},x_1< y_1,x_1\ge y_2)\\\\
&\ \ \ + P(1+x_1-y_1 \leq d_{11},1+x_1-y_2 \leq d_{12},x_1< y_1,x_1< y_2) \\\\
& = \iiint\limits_{0\leq x_1-y_1 \leq d_{11},0\leq x_1-y_2 \leq d_{12}} f(x_1,y_1,y_2)\mathrm{d}x_1 \mathrm{d}y_1 \mathrm{d}y_2 \\\\
&\ \ \ +\iiint\limits_{0\leq x_1-y_1 \leq d_{11},x_1-y_2 \leq d_{12}-1,x_1-y_2<0} f(x_1,y_1,y_2)\mathrm{d}x_1 \mathrm{d}y_1 \mathrm{d}y_2 \\\\
&\ \ \ +\iiint\limits_{x_1-y_1 \leq d_{11}-1,x_1-y_1<0,0\leq x_1-y_2 \leq d_{12}} f(x_1,y_1,y_2)\mathrm{d}x_1 \mathrm{d}y_1 \mathrm{d}y_2 \\\\
&\ \ \ +\iiint\limits_{x_1-y_1 \leq d_{11}-1,x_1-y_1<0,x_1-y_2 \leq d_{12}-1,x_1-y_2<0} f(x_1,y_1,y_2)\mathrm{d}x_1 \mathrm{d}y_1 \mathrm{d}y_2 \\\\
\end{aligned}
$$

Since $D_{ij}$ follows a uniform distribution between 0 and 1, it is evident that $d_{ij}$ must lie within the interval [0, 1].

If it were just a double integral, I could draw a graph to obtain the result or split the double integral into two parts based on the graph, because the upper and lower bounds of $y_1(x_1)$ are different in these two parts, as shown in the diagram below.

enter image description here

But for me, solving the above four triple integrals is a bit challenging, so I am seeking help from everyone.

For the function $F(D_{11}, D_{12}, …, D_{1n})$, since it involves $n+1$ variables ($x_1, y_1, y_2, …, y_n$) and each $D_{ij}$ is a piecewise function, theoretically, it should be split into $2^n$ $n+1$-fold integrals for calculation. Similarly, for the function $F(D_{11}, D_{12}, …, D_{1n}, D_{21}, D_{22}, …, D_{2n}, …, D_{m1}, D_{m2}, …, D_{mn})$, it should be split into $2^{mn}$ $m+n$-fold integrals for calculation. However, I hope to discover patterns in the solution process of $F(D_{11}, D_{12})$.

For the above question, a friendly and highly intelligent guy provided a fantastic answer. That is for Y=$min(D_{11}, D_{12}, …, D_{1n}, D_{21}, D_{22}, …, D_{2n}, …, D_{m1}, D_{m2}, …, D_{mn})$,
$$
P(Y\ge d) = \tbinom{m+n-1}{n}^{-1}\sum_{k=1}^{min(m,n)}\tbinom{m}{k}\tbinom{n-1}{k-1}max(0,1-kd)^{m+n-1}
$$

and
$$
E(Y) = \frac{1}{mn}(1-\tbinom{m+n}{n}^{-1})
$$

After numerical experiments, it has been verified that this answer is essentially completely correct.

Furthermore, I would like to investigate the expectation of the sum of distances obtained by continuously performing greedy matching in an $m x n$ distance matrix.Let's assume that $m ≥ n$ for the purpose of this discussion.

Similarly, I would like to start exploring possible patterns by considering a small-scale example. Therefore, let's assume $n = 2$.

So for the first greedy match("GM"),the $E(GM_1) = \frac{1}{2m}(1-\tbinom{m+2}{2}^{-1})=\frac{m+3}{2(m+1)(m+2)}$.For an $m \times 2$ distance matrix, after finding the minimum value, we match the corresponding x and y, and then remove the respective row and column because they can only be matched once.

This operation leaves us with a $(m-1) \times 1$ matrix (generally, it would be $(m-1) \times (n-1) \ matrix)$. We then continue greedily searching for the minimum value among the remaining elements in this residual matrix, which becomes the $GM_2$.

If there are no additional constraints, the distribution function of $GM_2$ should be similar to $GM_1$ because $GM_2$ can be considered as the minimum value between $(m-1)$ $x$-points and 1 $y$-point.However, it is evident that the matrix corresponding to $GM_2$ is a subset of the matrix corresponding to $GM_1$.

So, in reality, what I want to solve is $E(GM_2|GM_2\ge GM_1)$.

I think I can start by solving for $E(GM_2|GM_2\ge GM_1, GM_1=g)$,which is equivalent to $E(GM_2|GM_2\ge g)$.

Without any conditions, $P(GM_2\ge d)=\tbinom{m-1}{1}^{-1}\tbinom{m-1}{1}(1-d)^{m-1}=(1-d)^{m-1}$

so $f_{GM_2}(d)=(m-1)(1-d)^{m-2}$

$$
\begin{aligned}
E(GM_2|GM_2\ge g) & =\frac{\int_{g}^1 x \mathrm{d} F_{GM_2}(x) }{P(GM_2\ge g)} \\\\
& = \frac{m^{-1}(1-g)^{m-1}(gm-g+1)}{(1-g)^{m-1}}\\\\
& = \frac{gm-g+1}{m}
\end{aligned}
$$

because
$$
P(GM_1\ge d) = \tbinom{m+n-1}{n}^{-1}\sum_{k=1}^{min(m,n)}\tbinom{m}{k}\tbinom{n-1}{k-1}max(0,1-kd)^{m+n-1}
$$

so
$$
f_{GM_1}(d)=-P'(GM_1\ge d) = \tbinom{m+n-1}{n}^{-1}\sum_{k=1}^{min(m,n)}\tbinom{m}{k}\tbinom{n-1}{k-1}k(m+n-1)max(0,(1-kd))^{m+n-2}
$$

let $n=2$,
$$
f_{GM_1}(d)=2[(1-d)^m+(m-1)max(0,1-2d)^m]
$$

so
$$
\begin{aligned}
E(GM_2|GM_2\ge GM_1) &= \int_{0}^{1}\frac{gm-g+1}{m}f_{GM_1}(g)\mathrm{d}g\\\\
& = \int_{0}^{1}\frac{gm-g+1}{m}2[(1-g)^m+(m-1)max(0,1-2g)^m]\mathrm{d}g\\\\
& =2[\int_{0}^{1}\frac{gm-g+1}{m}(1-g)^m\mathrm{d}g + \int_{0}^{1/2}\frac{gm-g+1}{m}(m-1)(1-2g)^m\mathrm{d}g]\\\\
& = 2*\{\frac{1}{m}[\frac{(m-1)}{m^2+3m+2}+\frac{1}{m+1}] + \frac{m-1}{m}[\frac{m-1}{4(m+1)(m+2)}+\frac{1}{2(m+1)}]\} \\\\
& = \frac{3m^2+8m+1}{2m(m+1)(m+2)}
\end{aligned}
$$

Unfortunately, the obtained result does not match the results from experimental simulations, indicating that there may be an error. Could someone please point out where the mistake lies? Thank you very much!

Best Answer

Your variable $D_{ij}$ is the distance from $x_i$ to $y_j$ in, say, the clockwise direction if you bend the unit interval into a circle. So you want the distribution of the minimum $Y$ of the clockwise distances from $m$ points $x_i$ to $n$ points $y_j$.

This can be determined by distinguishing the cases in which the $x_i$ and the $y_j$ each form $k$ contiguous blocks, with $1\le k\le\min(m,n)$. In such an arrangement, for $Y$ to be $\ge d$ there must be clockwise gaps of length $d$ from an $x$-block to a $y$-block, whereas there’s no constraint on the clockwise gap from a $y$-block to an $x$-block.

Pick one of the points at which to cut the circle into an interval. Then successively remove each of the $k$ gaps of length $d$ by subtracting $d$ from all points that come after it. This moves all the points into an interval of length $1-kd$, and this transformation is bijective: Each configuration with all points in this interval is transformed into a configuration with $k$ gaps of length $d$ by adding such gaps clockwise of the $x$-blocks. The volume of the set of configurations in which the remaining $m+n-1$ points lie within $1-kd$ clockwise of the chosen point is $(1-kd)^{m+n-1}$, so this is also the probability for there to be a gap of length $d$ clockwise of each $x$-block.

To count the cyclic permutations with $k$ contiguous blocks of each type, first arrange the $m$ points $x_i$ cyclically in one of $(m-1)!$ ways, then choose $k$ of the $m$ bins between them in one of $\binom mk$ ways, then distribute $n$ balls into these $k$ non-empty bins in one of $\binom{n-1}{k-1}$ ways, and then assign the $n$ points $y_j$ to these $n$ balls in one of $n!$ ways. In total there are $(m+n-1)!$ cyclic permutations, so the probability to form $k$ groups is

$$ \frac{(m-1)!\binom mk\binom{n-1}{k-1}n!}{(m+n-1)!}=\frac{\binom mk\binom{n-1}{k-1}}{\binom{m+n-1}n}\;. $$

A more symmetric way to derive the same result is to distribute $m$ balls into $k$ non-empty groups in $\binom{m-1}{k-1}$ ways and $n$ balls into $k$ non-empty groups in $\binom{n-1}{k-1}$ ways and divide by $k$ for cyclicity, whereas the total number of cyclic arrangements of $m$ and $n$ balls of two types is $\binom{m+n}m$ divided by $m+n$ for cyclicity, which also yields

$$ \frac{\frac1k\binom{m-1}{k-1}\binom{n-1}{k-1}}{\frac1{m+n}\binom{m+n}m}=\frac{\binom mk\binom{n-1}{k-1}}{\binom{m+n-1}n}\;. $$

Putting everything together, we have

$$ \mathsf P(Y\ge d)={\binom{m+n-1}n}^{-1}\sum_{k=1}^{\min(m,n)}\binom mk\binom{n-1}{k-1}\max(0,1-kd)^{m+n-1}\;. $$

For example, for $m=3$ and $n=4$ this is

$$ \mathsf P(Y\ge d)=\frac1{15}\left(3\max(0,1-d)^6+9\max(0,1-2d)^6+3\max(0,1-3d)^6\right)\;. $$

Here’s a plot that compares this cumulative distribution function with the corresponding empirical distribution function from a simulation with $1000$ trials, with good agreement:

comparison of distribution function with empirical distribution function

It turns out that this yields a rather simple expression for the expected value of $Y$:

\begin{eqnarray*} \mathsf E[Y] &=& \int_0^1\mathsf P(Y\ge d)\,\mathrm dd \\ &=& \int_0^1{\binom{m+n-1}n}^{-1}\sum_{k=1}^{\min(m,n)}\binom mk\binom{n-1}{k-1}\max(0,1-kd)^{m+n-1}\mathrm dd \\ &=& {\binom{m+n-1}n}^{-1}\sum_{k=1}^{\min(m,n)}\binom mk\binom{n-1}{k-1}\int_0^\frac1k(1-kd)^{m+n-1}\mathrm dd \\ &=& {\binom{m+n-1}n}^{-1}\sum_{k=1}^{\min(m,n)}\binom mk\binom{n-1}{k-1}\frac1{k(m+n)} \\ &=& \frac1{mn}{\binom{m+n}n}^{-1}\sum_{k=1}^{\min(m,n)}\binom mk\binom nk\;. \end{eqnarray*}

By Vandermonde’s identity we have

$$\sum_{k=0}^{\min(m,n)}\binom mk\binom nk=\binom{m+n}n\;,$$

so adding and subtracting the $k=0$ term yields

$$\mathsf E[Y]=\frac1{mn}\left(1-{\binom{m+n}n}^{-1}\right)\;.$$

For $n=1$ (and analogously for $m=1$) this reduces to the standard result

\begin{eqnarray*} \mathsf E[Y] &=& \frac1m\left(1-{\binom{m+1}1}^{-1}\right) \\ &=& \frac1{m+1}\;. \end{eqnarray*}

Related Question