In fact there are two types of factors -- ordered (like Tiny < Small < Medium < Big < Huge) and unordered (Cucumber, Carrot, Fennel, Aubergine).
First class is the same as continuous ones -- there is only easier to check all pivots, there is also no problem with extending levels list.
For the second class, you have to make a set of elements that will be directed in one branch, leaving the rest into the other -- in this case you can either:
- throw error
- assume that unseen class goes into your favorite branch
- treat this as NA and select branch in more-less random way.
Now, direct treating of unordered factors is tricky so many algorithms "cheat" and claim unordered factors as ordered ones, so they don't even touch this problem*. The rest usually use int mask, i.e. optimize an integer number from $1$ to $2^{\text{#categories}-1}-1$ and treat $i$-th bit as a branch selection for a factor level $i$. (Ever wondered why there is frequent limit to 32 levels?) In this setting, it is quite natural that unseen levels go silently into "0" branch. Yet this does not seem too "right", because why really should we do this? And what about the number of levels required to entropy leverage in attribute selection?
I would say the the most sensible idea is to make the user define the full set of factors (for instance R does this organically, preserving levels through subset operations) and use option 1. for not-declared levels and option 2. for declared ones. Option 3. may make sense if you already have some NA handling infrastructure.
*) There is also side strategy to do some non-trivial re-coding of levels into numbers, like for instance Breiman encoding -- yet this generates even more problems.
Oblique Random Forests
The conventional random forest algorithm selects one feature to split at each node. This implies that if we want to achieve some kind of elaborate split (like defining a diagonal in two dimensions), many such splits will be required. On the other hand, if each node is something like a logistic regression, then we can directly accommodate linear combinations of features in a single node.
Breiman mentioned this possibility in passing in his original random forest paper, but has not drawn the same attention that "orthogonal" random forest have.
"On Oblique Random Forests"
Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe,
and Fred A. Hamprecht
Rotation Forests
Rotation forest uses PCA on the subsample of selected features at each split, and then splits in the orthogonal PCA basis. The effect is kinda like what you suggest here, because the projection into the PCA basis is a linear combination of the inputs.
This isn't exactly the same as your suggestion, though, since the parameter estiamtes in the linear combinations of PCA are entirely fixed by the data used to fit them. This stands in contrast to the method you propose, which estiamtes a linear function of the data.
Clearly, the cost of doing PCA at each split could be quite high, especially when the number of observations is large.
But the upside to rotation forest is that it can ameliorate a weakness of Random Forest. Random Forest struggles when the decision boundary is diagonal in the original space. To the extent that rotation by PCA re-aligns that diagonal, the model will be better able to capture that effect.
Here's the paper:
RodrĂguez JJ1, Kuncheva LI, Alonso CJ. "Rotation forest: A new classifier ensemble method." IEEE Trans Pattern Anal Mach Intell. 2006 Oct;28(10):1619-30.
Best Answer
Decision trees algorithms do not compute all possible trees when they fit a tree. If they did they would be solving an NP-hard problem. Decision tree fitting algorithms typically make greedy decisions in the fitting process—at each stage they optimize the sub-problem for finding an optimal split with the data in the given node and the keep moving forward in the fitting process. Also, as you move deeper into the decision tree you have a smaller set of data that has made it to the given node so that you will be optimizing the splitting rule over a smaller subset of data. All of these choices are linear scans of the data in the given node. This is not complicated to do but can become somewhat expensive computationally if you have a large number of observations or a large number of covariates to split on. However, a lot of the work can be split up and sent off to different machines to work on so there are ways to build out your computational architecture to scale up. In general though, the method works fairly quickly on lots of the datasets you see in coursework and in many real world scenarios as well.