Expected number of paths required to separate elements in a binary tree

combinatoricsgraph theory

Perhaps related to previous question: Expected average depth in random binary tree constructed top-to-bottom

Suppose I have $n$ elements that I want to put into proper binary trees (that is, each node in the tree must have either 2 children or be a terminal node), with a tree structure that is produced top-to-bottom by partitioning the remaining number of elements $m$ uniformly at random between $[1, \:m−1]$ to assign to one branch, and the rest to the other branch, until each branch contains only one element.

If I pick any two elements at random from the $n$ elements that are to be put into the binary tree, what is the expected number of splits/branching after which the two elements will be put into different branches?

For example: if $n = 2$ then there is only 1 possible way to split the elements, and they end up each in a different branch after it, so the expected number of splits until separation is $1$.

If $n = 3$, there's two possible tree structures that each have equal probability:

    .   |   .
   / \  |  / \
  .   o | o   .
 / \    |    / \
o   o   |   o   o

(but note that the trees are still constructed by splitting elements in each node uniformly at random, so from $n=4$ and higher, each possible tree will have different probability)

On the root node, if the elements are numerated, there's 12 possible ways to split them, each of which has the same probability (since this is decided at random):

1 | 2 3
1 2 | 3
1 | 3 2
1 3 | 2
2 | 1 3
2 1 | 3
2 | 3 1
2 3 | 1
3 | 1 2
3 1 | 2
3 | 2 1
3 2 | 1

If I pick two elements, say 1 and 2 (or any other pair), there's $8$ possible splits in which they end up being separated (out of 12), and if they don't end up separated in the first split, they will fall into a branch with only two nodes, in which they will get separated after an additional split, giving out:
$$
E[s_3] = \frac{8}{12} (1 + E[s_1]) + \frac{4}{12} (1 + E[s_2]) = \frac{8}{12} + 2 \frac{4}{12} = 1.3333
$$

For $n = 4$, there's $4! = 24$ permutations, each of which can be split in $3$ different ways ((o|ooo), (oo|oo), (ooo|o)), and from these $72$ equally possible splits, there are $8$ in which the elements 1 and 2 end up in the same branch with remainder size $m = 2$, and $24$ in which they end up in the same branch with remainder size $m = 3$, giving:
$$
E[s_4] = \frac{72 – 8 – 24}{72} (1 + E[s_1]) + \frac{8}{72} (1 + E[s_2]) + \frac{24}{72} (1 + E[s_3]) = 1.5555
$$

Likewise for $n = 5$:
$$
E[s_5] = \frac{480-24-72-144}{480}(1+E[s_1]) + \frac{24}{480}(1 + E[s_2]) + \frac{72}{480}(1 + E[s_3]) + \frac{144}{480}(1 + E[s_4]) = 1.7167
$$

Best Answer

A recurrence / not a closed form solution

Your specification generates a random binary tree of $n$ leaves, but due to your specific randomization method, not all trees are equally likely. Also, you only care about the subtree containing both nodes, and don't care about the rest of the tree at all. For these reasons, I find it helpful to think of this as chopping away part of an array, instead of generating a whole tree.

So you have an array of length $n$, and you chop it into two parts, a left sub-array of length $a$ and a right sub-array of length $b$, where $a+b=n$, and $a \sim Uniform(\{1, 2, \dots, n-1\})$.

You start with a random pair of elements in the original array, equally likely to be any of ${n \choose 2}$ pairs. Three things can happen to this pair:

  • Event $A$: Both elements are in the left sub-array. This happens with prob ${a \choose 2} / {n \choose 2}$. Also, conditioned on $A$ happening, the pair is equally likely to be any of the ${a \choose 2}$ pairs residing in the left sub-array.

  • Event $B$: Similar to above, but for the right sub-array.

  • Event $C$: One element is in the left sub-array and the other is in the right sub-array. This happens with prob $ab / {n \choose 2}$.

So, if $f(n)$ is the answer you seek, i.e. expected number of chops to separate the two nodes, we have:

$$ \begin{array}{} f(n) &= 1 + {1 \over n-1} \sum_{a=1}^{n-1} \Bigl( {{a \choose 2} \over {n \choose 2}}f(a) + {{b \choose 2} \over {n \choose 2}}f(b) \Bigr) \\ &= 1 + {1 \over n (n-1)^2} \sum_{a=1}^{n-1} \Bigl( a(a-1)f(a) + (n-a)(n-a-1)f(n-a) \Bigr) \\ &= 1 + {2 \over n (n-1)^2} \sum_{a=1}^{n-1} a(a-1)f(a) \end{array} $$

The first few values:

 n     f(n)
 = =============
 2 1.0
 3 1.33333333333
 4 1.55555555556
 5 1.71666666667
 6 1.84
 7 1.9380952381
 8 2.01836734694
 9 2.08551587302

A bit more algebraic manipulations lead to some simplified forms, but I am still not sure we can get a closed-form for $f(n)$.

Define $g(n) = n(n-1)f(n)$ and we have:

$$g(n) = n(n-1) + {2 \over n-1} (g(1) + \dots + g(n-1))$$

Define $s(n) = g(1) + \dots + g(n)$ and we have:

$$s(n) - s(n-1) = n(n-1) + {2 \over n-1} s(n-1)$$

$$s(n) = n(n-1) + {n+1 \over n-1} s(n-1)$$

From the last equation, and remembering $g(1)=s(1)=0$, we might be able to solve for $s(n)$, at least as a sum / product but perhaps better yet as a closed form, and we can "rewind" back to $f(n)$ from there.


UPDATE 2019/10/11: The rest consists of intuitive (inexact) arguments, for large $n$:

$$ \begin{array}{} &s(n) &= n(n-1) + {n+1 \over n-1} [ (n-1)(n-2) + {n \over n-2}s(n-2)] \\ & &= n(n-1) + (n+1)(n-2) + {n+1 \over n-1}{n \over n-2} [(n-2)(n-3) + \dots] \\ & &= n(n-1) + (n+1)(n-2) + {(n+1)n(n-3) \over n-1} + \dots \\ & &\approx n^3 + \dots \\ \implies &g(n) &= s(n) - s(n-1) \approx 3 n^2 \\ \implies &f(n) &\approx 3 \end{array} $$

Well, I wasn't expecting that! But I wrote code to calculate $f$ and it does seem true:

   n      f(n)
 =====  ========
     1  0.000000
     3  1.333333
    10  2.142681
    30  2.586898
   100  2.830813
   300  2.929329
  1000  2.974032
  3000  2.989885
 10000  2.996485
 30000  2.998682
100000  2.999556

I am not 100% sure that I am using the correct numeric data types (of sufficient precision) in my python code, so these may be slightly inaccurate, but the trend is unmistakably $f(\infty) \to 3$. Hmm, wonder if we can prove that...

The effect of one chop in the $n \to \infty$ case can be modeled as cutting the real interval $(0,1)$ at a uniformly random point $x$. The probability that a random pair is not divided by the cut at $x$, is $x^2 + (1-x)^2$. So averaged over all $x \in (0,1)$ we have the probability that a random pair is not divided by a random cut:

$$p = \int_0^1 (x^2 + (1-x)^2) dx = \int_0^1 (1 - 2x + 2x^2) dx = \Bigl[x-x^2 + \frac23 x^3 \Bigr]_0^1 = \frac23$$

Come to think of it, a much easier proof that $p=\frac23$ is to realize that the two points of the pair and the cut-point are all independent and $\sim Uniform(0,1)$, so the cut divides the pair iff it is the middle number, which happens with prob $\frac13$ (by symmetry).

Anyway, now $f(n) \to 3$ makes sense. Either the pair is divided, in which case we're done, or if not, then we gotta start over; and when $n\to \infty$ then starting over you still have $n \to \infty$ (very roughly speaking). So the expected number of chops is

$$1 + \frac23 + (\frac23)^2 + (\frac23)^3 + \dots = {1 \over 1 - \frac23} = 3$$

Disclaimer: None of this properly proves that $f(n) < 3 \,\,\forall n$, or $\lim_{n \to \infty} f(n) = 3$, but I would be very surprised if they weren't true.

Related Question