Solved – number of nodes in an unpruned decision tree

cartcombinatoricsmachine learning

What is the number of nodes in an unpruned decision tree that is trained using n samples and that grows until there is only one sample in each leaf?

I would like to know if there is a formula to compute it or at least some way to define a lower bound.

If each node splits the number of samples in half than the length of the decision tree is $log_2(n)$ and the number of nodes is $\sum_{i=1}^{log_2(n)} 2^i$. This formula is not more valid if the node does not split in half the number of samples. Is this a lower bound?

Best Answer

Supposing that you are in the case when all the terminal nodes has a single instance and you have only binary splits.

If we note with $n$ the number of instances in your training data set you will have exactly $n$ terminal nodes, one for each instance. Since you have a binary tree, and all nodes comes from a single root, that, it should be that each two terminal nodes has the same parent. Without loosing generality, we do not consider the case when there is an odd number of leaves (so we compute a maximal bound). This means that the nodes with are direct parents of terminal nodes are counted with $n/2$. Repeat the idea until you arrive at a single node which is root.

In order to explain to you in few words hot that is computed, see the following arrangement:

x (1 time)

x x (2 times)

x x x x (4 times)

x x x x x x x x (8 times)

....

x x x x x x... (n=times)

Note with $c_i$ the number of elements from the row $i$. You have then the following beautiful pattern:

$1 + c_1 = c_2$ or $1+1=2$

$1 + c_1 + c_2 = c_3$ or $1+1+2=4$

$1 + c_1 + c_2 + c_3 = c_4$ or $1+1+2+4=8$

And so on. So if you extend that for the whole tree, it looks like:

$1+ c_1+c_2+c_3+..+n=2n$

As a consequence the number of nodes is $2n-1$. This is a maximum, and holds only if $n$ is of form $2^k$. If $n$ does not have that form, you can find the smallest number greater than $n$ having the form $2^k$ and compute than the upper bound.

Related Question