Random Forest – Why a Single Decision Tree Cannot Represent an Entire Random Forest

cartrandom forest

I was intrigued by the reply from @JohnRos to the post Making a single decision tree from a random forest.
They say "<…> a random forest prediction cannot be represented by a single tree. This is because they learn predictors in different hypothesis classes: a random forest learns predictors over the space of linear combinations of trees, which includes predictors that are not trees <…>".
I cannot understand why a linear combination of trees includes predictors that are not trees.
So my first question is, can somebody please clarify the statement above?

So let us say I grow a Random Forest. I obtain a function that maps an input vector to an outcome.
Why cannot this function be replicated by a single tree?
Let me attempt a rational argument.
Random Forest will generate a piecewise-constant function, from a space $R^n$,where $n$ is the number of feature inputs, to $R$, a single output.
The domain is partitioned in regions with orthogonal boundaries.
Let me consider the case $n=2$, I fail to see why this could not be generalised.
The domain will be partitioned by Random Forest into something as

domain partition.

where colors represents regions over which the output is constant.
I take it that the pic represent domain whose sides are multiple of a fixed length, which will not be the case for RF, but that does not matter for the argument (simply extend each vertical/horizontal edge vertically/horizontally, will still end up in a grid).
Given this partition, I do not see why it could not be represented by a single tree.
Get the leftmost vertical split, and partition the top node: now the domain is split in two, leftmost column, all the other columns.
Keep partitioning the leftmost column with horizontal splits.
Now consider all the $n-1$ columns, and repeat. Split vertically the leftmost columns, and split this one with horizontal splits. And so on.

Now I have a tree that represents the RF function exactly.

The second question is then, where am I going wrong?

EDIT

To clarify my question, would like to underline the image I posted represents the decision boundary for a single tree, see for example this page, or slide 15 of this document, also slide 10 for a 2D and 3D examples.

My question can be then rephrased as follows.

  1. Given a tree, a decision boundary is determined.
  2. Given a Random Forest, a decision boundary is also determined
  3. Given the decision boundary of a RF, I argue that there is a procedure to reconstruct one possible tree, compatible with said Decision Boundary.
  4. Hence, it is possible to represent an entire Random Forest with a single tree.

Where is the flaw in this line of reasoning?

Best Answer

I think you are right and @JohnRos is wrong. Here's the argument.

  1. A random forest (RF) can be represented by a single tree if and only if the decision boundaries of the RF (within which a constant value is predicted) can be written as finite unions of hyperrectangles parallel to the coordinate axes. (Note that this does not mean that such a tree can be obtained from one of the standard algorithms to generate trees from data! You can construct the tree from systematically partitioning the space in order to reproduce the decision rule of the RF.)

  2. But this is in fact the case. If the RF was computed from $k$ individual trees, every observation that falls into the same intersection of the $k$ tree's leaves will end up predicted by the same constant, and these intersections must themselves be hyperrectangles parallel to the coordinate axes as this is what the original trees deliver.

This doesn't imply that the resulting tree is useful or allows better interpretation, as it may be extremely big. It basically needs to emulate all involved random trees, so trying to interpret it may just be as difficult as looking at the whole random tree ensemble that led to the RF, and nothing is won. And, as said before, neither does it mean that one could have any tree-building algorithm that could generate it directly from the data other than of course "do the RF first and then construct the tree from it".