[Math] Finding a node in a full binary tree: expected number of comparisons

expectationtrees

Consider a full binary search tree of height $k$ (the root is on level $1$ and the leaves on level $k$). By full I mean that all leaves are on level $k$ and level $k$ has exactly $2^{k-1}$ leaves. In other words there are $2^0 + \cdots + 2^{k-1} = 2^k -1$ nodes in the tree.

Search tree refers to the fact that given any non-leaf $x$, then the value of $x$ is bigger than the left child but smaller than the right child (wlog assume all values are distinct).

Suppose we want to find a leaf with a value $z$. Computing this is easy, for each comparison we move down a level in the tree so in total $k$ comparisons are needed to determine the location of $z$ (given it exists in the tree).

What if we wanted to search for $z$ in all nodes, and not just in the leaves. How many comparisons would one expect to make?

My computations:
1 comparison is needed to determine if $z$ is in level 1 (the root). 2 comparisons for level 2, etc. Given $n$ nodes in total, the number of nodes in level $t$ is $2^{t-1}$. So the probability that $z$ is in level $t$ is $2^{t-1}/n$. So the expected number of comparisons are
$1*(1/n) + 2*(2/n) + \cdots + k*(2^{k-1}/n) = \frac{1}{n}(1*2^0 + 2*2^1 + \cdots k*2^{k-1}).$
Let $f(x) = x + x^2 + \cdots + x^k$. Then clearly the sum in the parentheses is equal to $f'(2)$. $f(x) = \frac{x^{k+1}-x}{x-1}$ and so $f'(2) = 2^k(k-1)-1$. The expected number of comparisons on $n$ elements/nodes are therefore
$\frac{2^k(k-1)-1}{2^k-1}$ which seems to be extremely close to $\frac{2^k(k-1)-1}{2^k-1} \approx \frac{2^k(k-1)}{2^k}= k-1 < k =$ number of comparisons when searching for leaves only. So in other words, searching for leaves just adds one more comparison in expectation. Another way of saying it is that the difference between worst case and expected case is just one single comparison. Is this correct?

Best Answer

I didn't check every detail but the answer looks about right. Intuitively, most of the nodes are near the leaves, because the levels near the root have very few nodes, for instance the first half of the levels have only $2^{\frac{1}{2}k+1}-1$ nodes but the whole tree has $2^k-1$ nodes, which is roughly the square!

This is also why data should always be stored at the leaves unless you are very stingy for space, because there is a lot more that can be done (range queries) and more cleanly (immutable trees) when the internal nodes aren't forced into a dual role of both data and structure.