A Random Forest works by aggregating the results of many decision trees. Recently, I was reading about how the RandomForest aggregates the results, and it made me question whether the results from Decision trees are a single class prediction or a probability of each class.
the documentation for predict_proba
says
"The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf."
the part about "mean predicted class probabilities" indicates that the decision trees are non-deterministic. Furthermore, the lecture by Nando De Freitas here also talks of class probabilities at around 30 minutes.
My question is – how is it possible to get class probabilities from a single decision tree?
As far as I know, the default for a RandomForestClassifier in sklearn is a deterministic decision tree (for something like ExtraTreesClassifier, that is not the case). So, where do these said "class probabilities" for a single tree come from?
Best Answer
Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.
If you don't set
max_depth
or similar parameters, then the tree will always keep splitting until every leaf is pure, and all the probabilities will be 1 (as Soren says).Note that this tree is not nondeterministic; rather, given an input, it deterministically produces both a class prediction and a confidence score in the form of a probability.
Verification that this is what's happening:
(In the not-yet-released sklearn 0.17, this
get_node
helper can be replaced byclf.apply
.)