RandomForest Trees Voting – How to Make the RandomForest Trees Vote Decimals Instead of Binary?

classificationmachine learningrrandom forest

My question is about binary classification, say separating good customers from bad customers, but not regression or non-binary classification. In this context, a random forest is an ensemble of classification trees. For each observation, every tree votes a "yes" or "no", and the average vote of all trees is the final forest probability.

My question is about modifying the behavior of the underlying trees: How can we modify the randomForest function (of the randomForest package of R) so that each tree votes a decimal instead of a binary yes/no. To better understand what I mean by decimal, let's think about how decision trees work.

A fully grown decision tree has 1 good or 1 bad instance in its terminal nodes. Assume that I limit the terminal node size as 100. Then terminal nodes are going to look like:

Node1 = 80 bad, 20 good
Node2 = 51 bad, 49 good
Node3 = 10 bad, 90 good

Notice, even though Node1 and Node2 vote "bad", their "strength of bad-ness" is severely different. That is what I am after. Instead of having them produce 1 or 0 (which is the default behavior) can one modify the R package so they vote 80/100, 51/100, 10/100 etc?

Best Answer

This is a subtle point that varies from software to software. There are two main methods that I'm aware of:

  1. Binary leafs - Each leaf votes as the majority. This is how randomForest works in R, even when using predict(..., type="prob")
  2. Proportion leafs - Each leaf returns the proportion of the training samples belonging to each class. This is how sklearn.ensemble.RandomForestClassifier.predict_proba works. In another answer, @usεr11852 points out that R's ranger package also provides this functionality. Happily, I can attest that from my limited usage, ranger is also much, much faster than randomForest.

I don't think that there's an easy way to get randomForest to use the proportional leaf method, since the R software is actually just a hook into a C & FORTRAN program. Unless you enjoy modifying someone else's code, you'll either have to write your own, or find another software implementation.