Solved – Decision Trees: if always binary split

machine learning

The introduction of almost every lecture for Decision Trees is a decision tree, where one node might have more than two children. However, as I understand the decision tree is always a binary tree, right? During the implementation I one-hot encode all my categorical variables, meaning the split on such a feature will always be 0 or 1 (two children).

So, if I have a feature with normal, high values after a one-hot encoding I will get two features humidity.normal and humidity.high. And the split will be first performed on either humidity.normal and then humidity.high or vice versa.

Are the RandomForestClassifier/DescitionTree in python and r being constructed via binary trees (binary splits)?

enter image description here

Best Answer

CHAID Trees can have multiple nodes (more than 2), so decision trees are not always binary.

There are many different tree building algorithms and the Random Forest algorithm actually creates an ensemble of decision trees. In the original paper, the authors used a slight variation of the CART algorithm. However its not necessary to use CART to build a Random Forest model; other researchers have extended the concept by using tree learners built using different algorithms to generate a Random Forest model.

It is also not necessary that the independent variables which are used to build a decision tree necessarily have to be binary valued (0-1 valued or one-hot encoded categorical variables). Doing so would have severely restricted the usability of these algorithms. Each of the nodes may contain multiple categories after the split. For example, a variable with weather category as windy, foggy, rainy and sunny may be split into two nodes - Node 1 : windy + sunny, and Node 2 : foggy and rainy.

To answer your specific query: both scikit-learn and R use CART for building random forest models, so their base learners are binary trees.