Solved – How is splitting done on numerical predictors in randomForest package in R

rrandom forest

I understand that for the individual trees, a least squares measure is used to measure node impurity, given candidate splits of the data at that split, and the best split is selected.

What I don't understand yet (since I couldn't find an answer in the documentation) is how candidate splits are found in the first place, i.e., given numerical predictors (not nominal or ordinal), how are the split points found for those numerical predictors in the randomForest package?

Aside: I am also wondering whether ordinal predictors and dependent variables are supported in randomForest now?

Best Answer

It is the same as with ordinal variables -- the algorithm goes from the minimal to a maximal value present in the attributes as a candidate for a threshold and selects best. This can be elegantly speed-up to linear complexity using presorting.

Because of that randomForest simply converts ordered factors to numerical values for predictors and to categorical data in case of decision.

Related Solutions

Solved – How to choose the split in Random forest for categorical predictors (features)

The usual vanilla implementation tries all possible combinations of your categories. It expresses these combinations as an integer which represents which categories are selected and which are left out at the split. It goes from left to right. For example if you have a variable with the classes "Cat", "Dog", "Cow", "Rat" it would sweep through possible splits, meaning something like:

Dog vs the rest = 0100 (remember, read from left to right)

Cat vs the rest = 1000

By themselves, but also

Dog and Cat vs Cow and Rat = 1100

Cow and Cat vs Dog and Rat = 1010

And then, as mentioned, it uses integers to handle this, to represent the split:

library(R.utils)
> intToBin(12)
[1] "1100"

Random Forest – Understanding Gini Decrease and Gini Impurity of Children Nodes

You simply did not used the target class variable at all. Gini impurity as all other impurity functions, measures impurity of the outputs after a split. What you have done is to measure something using only sample size.

I try to derive formula for your case.

Suppose for simplicity you have have a binary classifier. Denote with $A$ the test attribute, with $C$ the class attribute which have $c_+, c_-$ values.

The initial gini index before split is given by $$I(A) = 1 - P(A_+)^2 - P(A_-)^2$$ where $P(A_+)$ is the proportion of data points which have $c_+$ value for class variable.

Now, impurity for left node would be $$I(Al) = 1 - P(Al_+)^2-P(Al_-)^2$$ $$I(Ar) = 1 - P(Ar_+)^2-P(Ar_-)^2$$ where $P(Al_+)$ is proportion of data points from left subset of $A$ which have value $c_+$ in the class variable, etc.

Now the final formula for GiniGain would be

$$GiniGain(A) = I(A) - p_{left}I(Al) - p_{right}I(Ar)$$ where $p_{left}$ is the proportion of instances for the left subset, or $\frac{\#|Al|}{\#|Al|+\#|Ar|}$ (how many instances are in left subset divided by the total number of instances from $A$.

I feel my notation could be improved, I will watch later when I will have more time.

Conclusion

Using only number of data points is not enough, impurity mean how well one feature (test feature) is able to reproduce the distribution of another feature (class feature). Test feature distribution produces the number you used (how to left, how to right), but distribution of the class feature is not used in your formulas.

Later edit - proove why it decrease

Now I noticed that I missed the part which proves why it always the gini index on child node is less than on parent node. I do not have a complete proove or a verified one, but I am thinking is a valid proof. For other interenting thing related with the topic you might check Technical Note: Some Properties of Splitting Criteria - Leo Breiman. Now it will follow my proof.

Suppose that we are in the binary case, and all the values in a node could be completely described by a pair $(a,b)$ with the meaning of $a$ instances of the first class, and $b$ instances of the second class. We can state than that in the parent node we have $(a,b)$ instances.

In order to find the best split we sort the instances according with a test feature and we try all the binary possible splits. Sorted by a given feature is actually a permutation of instances, in which classes starts with an instance of the first class or of the second class. Without loosing the generality, we will suppose that it starts with an instance of the first class (if this is not the case we have a mirror proof with the same calculation).

The first split to try is in the left $(1,0)$ and in the right $(a-1,b)$ instances. How the gini index for those possible candidates for left and right child nodes are compared with the parent node? Obviously in the left we have $h(left) = 1 - (1/1)^2 - (0/1)^2 = 0$. So on the left side we have a smaller gini index value. How about the right node?

$$h(parent) = 1 - (\frac{a}{a+b})^2 - (\frac{b}{a+b})^2$$ $$h(right) = 1 - (\frac{a-1}{(a-1)+b})^2 - (\frac{b}{(a-1)+b})^2$$

Considering that $a$ is greater or equal than $0$ (since otherwise how could we separate an instance of the first class in the left node?) and after simplification it's simple to see that the gini index for the right node has a smaller value than for the parent node.

Now the final stage of the proof is to node that while considering all the possible split points dictated by the data we have, we keep the one which has the smallest aggregated gini index, which means that the optimum we choose is less or equal than the trivial one which I prooved that is smaller. Which concludes that in the end the gini index will decrease.

As a final conclusion we have to note even if various splits can give values bigger that parent node, the one that we choose will be the smallest among them and also smaller that the parent gini index value.

Hope it helps.

Best Answer

Related Solutions

Solved – How to choose the split in Random forest for categorical predictors (features)

Random Forest – Understanding Gini Decrease and Gini Impurity of Children Nodes

Related Question