I am looking for an algorithm to incrementally generate distinct full binary trees over $N$ unique leaves. That is, I want an algorithm that can generate the $n^{th}$ distinct tree without generating all the $n-1$ trees before. Pre-generating all trees is practically impossible above a certain N
.
A full binary tree with its $N$ leaves labeled is equivalent to a binary grouping of $N$ unique elements grouped into pairs. There are $C_{N-1}$ different full binary trees or groupings of N
leaves, where $C_n$ is the $n^{th}$ Catalan number. For $N = 4$, there are $C_3 = 5$ trees. These are, with internal nodes labeled $5..7$ and with the equivalent grouping:
For $N = 5$:
(1 (2 (3 (4 5))))
(1 (2 ((3 4) 5)))
(1 ((2 3) (4 5)))
(1 ((2 (3 4)) 5))
(1 (((2 3) 4) 5))
((1 2) (3 (4 5)))
((1 2) ((3 4) 5))
((1 (2 3)) (4 5))
((1 (2 (3 4))) 5)
((1 ((2 3) 4)) 5)
(((1 2) 3) (4 5))
(((1 2) (3 4)) 5)
(((1 (2 3)) 4) 5)
((((1 2) 3) 4) 5)
I see three ways to solve this problem (ultimately they are equivalent):
- There is a simple algorithm that can directly generate the next distinct (non-isomorphic) tree incrementally.
- There is a bijective encoding from tree $T_i$ to sequence $S_i$ such that generating $S_{i+1}$ (decoding into tree $T_{i+1}$) is easily doable.
- In the ideal case, there is a simple bijection of the $C_{N-1}$ trees to a continuous interval of $C_{N-1}$ natural numbers (preferentially $(1..C_{N-1})$) so that generating the $i^{th}$ tree is as easy as to decode it from the integer $i$.
There are many algorithms to encode a binary tree to a bijectively unique sequence (e.g. to a Prüfer sequence) but the issue is then how to generate the next sequence that can be decoded to the next tree without many failed sequences that do not encode a valid tree of the above description and do not encode a tree that was already visited.
Best Answer
The number of ways to fully parenthesize a string of $n$ letters, $C_{n-1}$, obeys the following recurrence: $$ C_{n-1} = \sum_{i=1}^{n-1}C_{i-1}C_{n-i-1} $$ To see this, consider the two "shallowest" groups in the parenthesization. Namely, ignoring the leftmost
(
and the rightmost)
, look at the parenthesis matching the leftmost(
. This will surround the first $i$ letters in the string, which can be further paranthesized in $C_{i-1}$ ways, while the latter $n-i$ letters can be parenthesized in $C_{n-i-1}$ ways. For example when $n=5$, the $*$ illustrates all of the break points:This recurrence gives you a quickly computable bijection from the first $C_{n-1}$ non-negative integers to binary trees. You are given an integer $k$ for which $0\le k\le C_{n-1}-1$. Compute the partial sums $$ \sum_{i=1}^{s-1} C_{i-1}C_{n-i-1} $$ to find the largest number $s\ge 1$ for which that partial sum is at most $k$. Then, insert parentheses into the list of numbers
(1 2 3 ... n)
as follows:If $s=1$, you can omit the parentheses around
(1)
, and similarly when $s=n-1$ around(n)
.Then, letting $$e=k - \Big(\sum_{i=1}^{s-1}C_{i-1}C_{n-i-1}\Big),$$ and letting \begin{align}k_1&=e\pmod {C_{s-1}}\\k_2&=\lfloor e/C_{s-1}\rfloor,\end{align} you will have $0\le k_1\le C_{s-1}-1$ and $0\le k_2\le C_{n-s-1}-1$, and you can recursively apply the bijections for $k_1$ to the list
(1 2 ... s)
and for $k_2$ to the list(s+1 s+2 ... n)
.Edit: There was a "bug" in my bijection which I just fixed. You can test that it works here.
Edit2: I just fixed another off-by-one error.