Solved – How are random forest and extremely randomized trees split differently

classificationpythonrandom forestscikit learn

For random forest, we split the node by Gini impurity or entropy for a set of features. The RandomForestClassifier in sklearn, we can choose to split by using Gini or Entropy criterion. However, what I read about Extra-Trees Classifier, a random value is selected for the split (I guess then there is nothing to do with Gini or Entropy). The ExtraTreesClassifier from sklearn has the option to choose Gini or Entropy for the split. I am a little bit confused here.

Best Answer

One iteration of Random Forest:

Select $m$ features randomly as a candidate set of splitting features
Within each of these features, find "best" cutpoint, where "best" is defined by Gini / Entropy / whatever measure
Now you have $m$ features paired with their optimal cutpoints. Choose as your splitting feature and cutpoint the pair that has the "best" performance with respect to Gini / Entropy / whatever measure

One iteration of Extremely Randomized Trees:

Select $m$ features randomly as a candidate set of splitting features
Within each of these features $F_i$, with $i \in {1, ...,m}$ draw a single random cutpoint uniformly from the interval $(min(F_i), max(F_i))$. Evaluate the performance of this feature with this cutpoint with respect to Gini / Entropy / whatever measure
Now you have $m$ features paired with their randomly selected cutpoints. Choose as your splitting feature and cutpoint the pair that has the "best" performance with respect to Gini / Entropy / whatever measure

Related Solutions

Solved – How to choose the split in Random forest for categorical predictors (features)

The usual vanilla implementation tries all possible combinations of your categories. It expresses these combinations as an integer which represents which categories are selected and which are left out at the split. It goes from left to right. For example if you have a variable with the classes "Cat", "Dog", "Cow", "Rat" it would sweep through possible splits, meaning something like:

Dog vs the rest = 0100 (remember, read from left to right)

Cat vs the rest = 1000

By themselves, but also

Dog and Cat vs Cow and Rat = 1100

Cow and Cat vs Dog and Rat = 1010

And then, as mentioned, it uses integers to handle this, to represent the split:

library(R.utils)
> intToBin(12)
[1] "1100"

Solved – Relationship between Gini Importance and Prediction Performance (say AUC)

I'm not sure there is a great answer to this questions. But maybe

As far as I know there were four measures of variable importance in the original Breiman paper, with only two making it into the randomForest package. But the permutation variable importance appears to be much more popular, likely because it is much easier to understand how this might produce intuitive measures of variable importance and how these measures of variable importance relate to predictive ability of the model. My understanding is that valSelRF, Boruta and conditional variable importance in party all use the permutation variable importance.

Variable importance are suggestive, but hard to make inferences from output. That might be why there are so many ways of calculating variable importance. the relaimpo package has six measures. Both the Boruta and relaimpo package vignettes are worth reading. Discuss subject at length. The relaimpo vignette especially emphasizes limitations of methods and the conflicting results you can obtain. The methods may be efficient at finding all-relevant feature, but often produce conflicting results when ranking feature. (You'll find the values also change when you change the cost function. )

Best Answer

Related Solutions

Solved – How to choose the split in Random forest for categorical predictors (features)

Solved – Relationship between Gini Importance and Prediction Performance (say AUC)

Related Question