Solved – How are random forest and extremely randomized trees split differently

classificationpythonrandom forestscikit learn

For random forest, we split the node by Gini impurity or entropy for a set of features. The RandomForestClassifier in sklearn, we can choose to split by using Gini or Entropy criterion. However, what I read about Extra-Trees Classifier, a random value is selected for the split (I guess then there is nothing to do with Gini or Entropy). The ExtraTreesClassifier from sklearn has the option to choose Gini or Entropy for the split. I am a little bit confused here.

Best Answer

One iteration of Random Forest:

  1. Select $m$ features randomly as a candidate set of splitting features
  2. Within each of these features, find "best" cutpoint, where "best" is defined by Gini / Entropy / whatever measure
  3. Now you have $m$ features paired with their optimal cutpoints. Choose as your splitting feature and cutpoint the pair that has the "best" performance with respect to Gini / Entropy / whatever measure

One iteration of Extremely Randomized Trees:

  1. Select $m$ features randomly as a candidate set of splitting features

  2. Within each of these features $F_i$, with $i \in {1, ...,m}$ draw a single random cutpoint uniformly from the interval $(min(F_i), max(F_i))$. Evaluate the performance of this feature with this cutpoint with respect to Gini / Entropy / whatever measure

  3. Now you have $m$ features paired with their randomly selected cutpoints. Choose as your splitting feature and cutpoint the pair that has the "best" performance with respect to Gini / Entropy / whatever measure