Expected average depth in random binary tree constructed top-to-bottom

combinatoricsrandom-graphs

Suppose I have $n$ elements that I want to put into proper binary trees (that is, each node in the tree must have either 2 children or be a terminal node – no 1-node-only are allowed), with a tree structure that is produced top-to-bottom by partitioning the remaining number of elements $m$ uniformly at random between $[1, m-1]$ to assign to one branch, and the remainder to the other branch.

Example possible structures with 3 nodes:

    .   |   .
   / \  |  / \
  .   o | o   .
 / \    |    / \
o   o   |   o   o

Example possible structures with 4 nodes:

    .      |   .       |       .    |
   / \     |  / \      |      / \   |
  .   .    | o   .     |     .   o  |
 /\   /\   |    / \    |    / \     |
o  o o  o  |   o   .   |   .   o    |
           |      / \  |  /\        |
           |     o   o | o  o       |
-------------------------------------
    .      |      .    |            |
   / \     |     / \   |            |
  o   .    |    .   o  |            |
     / \   |   / \     |            |
    .   o  |  o   .    |            |
   / \     |     / \   |            |
  o   o    |    o   o  |            |
-------------------------------------

What would be the expected value of the average depth from the tree root to a terminal node if the tree structure is determined at random from top to bottom?

E.g. in the 3-node trees, the depths are $\{1, 2, 2\}$ in both cases, for an average depth of $\frac{5}{3}$ in both and the same expected value (both are equally likely and there are no more possibilities), while in the 4-node trees they are $\{2, 2, 2, 2\}$ (average $2$) in the first and $\{1, 2, 3, 3\}$ (average $2.25$) in all the others, giving an expected value of $E[d_4] = 2 \frac{1}{3} + 2.25 (1 – \frac{1}{3}) = 2.16667$ (since there's a probability $p = \frac{1}{3}$ that the first structure would be chosen – that's the probability of the first split putting two elements at each side and from there there's only one possible way to split them).

I see there are other answers for the variation in which each tree structure is equally likely, such as this one: https://cs.stackexchange.com/questions/99238/deriving-the-average-depth-for-a-randomly-generated-binary-search-tree?rq=1 – but they are not exactly the same
scenario and I'm wondering what would be the solution here.

Best Answer

Let $t_m$ denote the expected total depth: this is only off by a factor of $m$ from the expected average depth, but has a nicer recursive description.

Specifically, $$ t_m = \frac1{m-1}\sum_{i=1}^{m-1}(t_i + t_{m-i} + m). $$ Each of $m-1$ initial splits is equally likely, so we average over all of them. If the split is $i$ on the left and $m-i$ on the right, then the expected total for the left subtree is $t_i$, the expected total for the right subtree is $t_{m-i}$, except that we have to increase each depth by $1$ (increasing the total by $m$) since they are all subtrees.

We can simplify this recurrence a bit to $$ t_m = m + \frac2{m-1} \sum_{i=1}^{m-1} t_i, $$ but the summation is still awkward. Fortunately, there's a standard trick to fixing that. Since we have \begin{align} (m-1) t_m &= m(m-1) + 2 \sum_{i=1}^{m-1}t_i, \\ (m-2)t_{m-1} &= (m-1)(m-2) + 2 \sum_{i=1}^{m-2} t_i, \end{align} by taking the difference of the two equations we cancel most of the sum, and get $$ (m-1) t_m - (m-2)t_{m-1} = 2(m-1) + 2t_{m-1}, $$ or $(m-1)t_m = 2(m-1) + m t_{m-1}$. Now each term is given only in terms of the previous.

To finish solving the recurrence, divide through by $m(m-1)$, which gives us $$ \frac{t_m}{m} = \frac{t_{m-1}}{m-1} + \frac 2m. $$ With $s_m = \frac{t_m}{m}$, this is just $s_m = s_{m-1} + \frac 2m$. The substitution $s_m = \frac{t_m}{m}$ is not only convenient, but also gives us the original problem back: $s_m$ is exactly the expected average depth.

From $s_m = s_{m-1} + \frac2m$ with the initial condition $s_2 = 1$ (since there is only one possible tree with $2$ leaves, which has average depth $1$) we get $$ s_m = 1 + \frac23 + \frac24 + \dots + \frac2m = \sum_{i=2}^m \frac2i. $$ This is as closed a form as we get, though we can rewrite it as $s_m = 2(H_m-1)$ in terms of the $m^{\text{th}}$ harmonic number.

For $m = 2, 3, \dots$ the sequence of expected average depths begins $$ 1,\frac{5}{3},\frac{13}{6},\frac{77}{30},\frac{29}{10},\frac{223}{70},\frac{481}{140},\frac{4609}{1260},\frac{4861}{1260}, \dots $$