My question is about binary classification, say separating good customers from bad customers, but not regression or non-binary classification. In this context, a random forest is an ensemble of classification trees. For each observation, every tree votes a "yes" or "no", and the average vote of all trees is the final forest probability.
My question is about modifying the behavior of the underlying trees: How can we modify the randomForest function (of the randomForest package of R) so that each tree votes a decimal instead of a binary yes/no. To better understand what I mean by decimal, let's think about how decision trees work.
A fully grown decision tree has 1 good or 1 bad instance in its terminal nodes. Assume that I limit the terminal node size as 100. Then terminal nodes are going to look like:
Node1 = 80 bad, 20 good
Node2 = 51 bad, 49 good
Node3 = 10 bad, 90 good
Notice, even though Node1 and Node2 vote "bad", their "strength of bad-ness" is severely different. That is what I am after. Instead of having them produce 1 or 0 (which is the default behavior) can one modify the R package so they vote 80/100, 51/100, 10/100 etc?
Best Answer
This is a subtle point that varies from software to software. There are two main methods that I'm aware of:
randomForest
works in R, even when usingpredict(..., type="prob")
sklearn.ensemble.RandomForestClassifier.predict_proba
works. In another answer, @usεr11852 points out that R'sranger
package also provides this functionality. Happily, I can attest that from my limited usage,ranger
is also much, much faster thanrandomForest
.I don't think that there's an easy way to get
randomForest
to use the proportional leaf method, since the R software is actually just a hook into a C & FORTRAN program. Unless you enjoy modifying someone else's code, you'll either have to write your own, or find another software implementation.