Binary trees constructed from the bottom up

random-graphstrees

I'm dealing with a set of random binary trees which I can't find referenced anywhere in literature. Computer scientists seem to prefer "random search trees" which is a different ensemble than mine (that's partially why I'm asking here rather than the CS stackexchange: I don't think this ensemble is relevant to CS at all, though I may be wrong).

I would be very grateful if anyone could point me in the right direction toward a better understanding of these lovely graphs.

"Fusion binary trees"

(Not to be confused with fusion trees.)
Consider a set of $ n $ labeled nodes $ \{(1), \dotsc, (n)\} $, initially disjoint from each other. Two such nodes are considered "adjacent" if their labels differ by 1 modulo $ n $.

Adjacent nodes can be joined into a tree by introducing a new node having those nodes as left- and right-children. For instance, I can introduce a new node having $ (3) $ and $ (4) $ as children, and call it $ (3,4) $. Now $ (3) $ and $ (4) $ cannot be fused anymore, whereas $ (3,4) $ is considered adjacent to $ (2) $ and $ (5) $ and can be fused with them.

Iteratively I can thus fuse all the nodes together into a single binary tree.

The random ensemble

I can encode a sequence of fusions as a permutation of $ (1, \dotsc, n) $, where $ k $ represents the fusion of the unique unfused node containing label $ k $ with the node immediately to its right (with (1) being considered "to the right of" (n)). This encoding is well-defined though not one-to-one (fusing far-away nodes one after another is independent of the order of fusions).

The random ensemble I'm considering is then defined by taking uniformly random permutations of $ (1, \dotsc, n) $. Note that this is a biased distribution on the space of all binary trees with $ n $ leaves.

My question

I'm trying to compute the probability distribution of the graph distance between two adjacent leaves of this tree. I'm content with the asymptotic form for large $ n $. Note that this pdf is the same for all pairs of adjacent leaves due to the periodic boundary conditions (in alternative, use open boundary conditions and focus on the central leaves).

I expect from numerical results that this pdf have an average of $ O(\log n) $, but I've been unsuccessful at finding analytical results.

If you have any idea on how to tackle this problem, or any references where this ensemble is considered in literature, thank you very much in advance.

Best Answer

As a starting point, the expected value of the distance between two adjacent nodes is $4(1 - \frac1n)$.

If you traverse the shortest path from node $1$ to node $2$, then from node $2$ to node $3$, and so on until you take the shortest path from node $n$ to node $1$, then you end up using each edge of the tree exactly twice. (An edge created when we fused the subtree containing nodes $\{i,i+1, \dots, j-1,j\}$ to another subtree is used when we go from $i-1$ to $i$, and again when we go from $j$ to $j+1$.)

There are $2n-1$ total nodes in the tree, and therefore there are $2n-2$ edges. So the total length of these shortest paths is $4n-4$. By symmetry (in the periodic-boundary version), each shortest path has the same distribution, so each shortest path has length $\frac{4n-4}{n} = 4(1 - \frac1n)$ in expectation.


Here's a not-very nice but exact description of the distribution.

Another useful view of "fusion binary trees" with the open boundary condition is as "fission binary trees" (an equally made-up term). Start with a tree that has $1$ node and no edges. Then, repeatedly choose a uniformly random leaf, and add a left child and a right child to that leaf. After $n-1$ steps, number all the leaves $1, 2, \dots, n$ from left to right.

This model of random binary trees is equivalent, except that all the choices are made in reverse. We can prove this by induction. In your model, if we choose a random pair $(k,k+1)$ to fuse into one node, the rest of the tree is constructed from the vertices $\{1, 2, \dots, k-1, (k,k+1), k+2, \dots, n\}$, so it's a random $(n-1)$-leaf fusion binary tree. By the induction hypothesis, this is equivalent to a random $(n-1)$-leaf fission binary tree. Additionally, each adjacent pair $(k,k+1)$ was equally likely to be chosen to fuse first, so every leaf of the $(n-1)$-leaf tree is equally likely to be the fused one - and choosing which leaf that is is equivalent to fission of a random leaf of the $(n-1)$-leaf tree.

The fission model gives us a different numbering of adjacent pairs: instead of numbering them $\{1,2\}$, $\{2,3\}$, and so on through $\{n-1,n\}$, we can number them according to the step in the fission process when that pair was split up. That is, the step at which their closest common ancestor was chosen as the leaf to receive children. Each pair gets a number between $1$ and $n-1$.

This is useful, because we can talk about the distribution of the distance between the adjacent pair split up at step $k$. At each step after step $k$, there are $2$ possible leaves to choose which would contribute $1$ to that distance (starting from a baseline of $2$). So the distance between the adjacent pair has distribution $$ 2 + \mathrm{Bernoulli}(\tfrac2{k+1}) + \mathrm{Bernoulli}(\tfrac2{k+2}) + \dots + \mathrm{Bernoulli}(\tfrac2{n-1}) $$ where the Bernoulli terms are independent. Similarly, the distance between the leftmost and rightmost vertex has the distribution $$ 2 + \mathrm{Bernoulli}(\tfrac2{2}) + \mathrm{Bernoulli}(\tfrac2{3}) + \dots + \mathrm{Bernoulli}(\tfrac2{n-1}) $$ because (after the very first fission step) the distance between them increases by $1$ whenever the first or last vertex in the fission tree splits.

Now let's go back to the fusion binary tree with periodic boundary, since that has symmetry. We can obtain it from the fission binary tree model by picking a random $s$ between $1$ and $n$, and numbering the leaves at the bottom of the $n$-leaf fission tree as $s, s+1, \dots, n, 1, \dots, s-1$ from left to right.

Each adjacent pair has a $\frac1n$ chance to be the pair split up at step $k$ in the fission process, and a $\frac1n$ chance to end up being the pair $\{s-1,s\}$. So we get a mixture distribution of the $n$ possible Bernoulli sums given above. This mixture distribution describes the distribution of adjacent-pair distance exactly, though it's not a very nice answer.

In particular, the $\mathrm{Bernoulli}(\frac2t)$ term has a $\frac tn$ chance of being included for each adjacent pair, so we can think of this distribution as $2$ plus a sum of $n-2$ different $\mathrm{Bernoulli}(\frac2n)$ random variables. But this does not make the distribution $2 + \mathrm{Binomial}(n-2, \frac2n)$ because the $\mathrm{Bernoulli}(\frac2n)$'s are correlated: the $\mathrm{Bernoulli}(\frac2t)$ terms are not included independently.