Solved – Non-perpendicular hyperplane decision trees

cartclassificationrandom forest

The usual implementation of decision trees, and estimators based on ensembles of decision trees, use decision trees that threshold on a single predictor at each split. So they divide the feature space into perpendicular hyperlanes.

Is there any benefit to generalizing this a tiny bit. So at each split, the decision tree uses 2 features, and thresholds based on whether $ax_1 + bx_2 \ge T$. In other words, the feature space is now getting divided into hyperplanes that are not necessarily perpendicular.

I think the main disadvantages would be:

  1. increased computation, and
  2. overfitting since now using more degrees of freedom

I guess to help both of these issues, randomization could be used analogous to extra trees. The pair of features could be randomly selected, instead of exhaustively trying every pair, and the threshold T and parameters a and b could be chosen randomly as well (perhaps from a set of few discrete values such that the hyperplanes could have a few orientations).

Are there some cases where you think these non-perpendicular decision trees would be an improvement over the regular implementation of various decision tree algorithms?

Best Answer

Oblique Random Forests

The conventional random forest algorithm selects one feature to split at each node. This implies that if we want to achieve some kind of elaborate split (like defining a diagonal in two dimensions), many such splits will be required. On the other hand, if each node is something like a logistic regression, then we can directly accommodate linear combinations of features in a single node.

Breiman mentioned this possibility in passing in his original random forest paper, but has not drawn the same attention that "orthogonal" random forest have.

"On Oblique Random Forests" Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Hamprecht

Rotation Forests

Rotation forest uses PCA on the subsample of selected features at each split, and then splits in the orthogonal PCA basis. The effect is kinda like what you suggest here, because the projection into the PCA basis is a linear combination of the inputs.

This isn't exactly the same as your suggestion, though, since the parameter estiamtes in the linear combinations of PCA are entirely fixed by the data used to fit them. This stands in contrast to the method you propose, which estiamtes a linear function of the data.

Clearly, the cost of doing PCA at each split could be quite high, especially when the number of observations is large.

But the upside to rotation forest is that it can ameliorate a weakness of Random Forest. Random Forest struggles when the decision boundary is diagonal in the original space. To the extent that rotation by PCA re-aligns that diagonal, the model will be better able to capture that effect.

Here's the paper: Rodríguez JJ1, Kuncheva LI, Alonso CJ. "Rotation forest: A new classifier ensemble method." IEEE Trans Pattern Anal Mach Intell. 2006 Oct;28(10):1619-30.