Solved – Is decision tree output a prediction or class probabilities

cartclassificationdecision-theorypythonrandom forest

A Random Forest works by aggregating the results of many decision trees. Recently, I was reading about how the RandomForest aggregates the results, and it made me question whether the results from Decision trees are a single class prediction or a probability of each class.

the documentation for predict_proba says

"The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf."

the part about "mean predicted class probabilities" indicates that the decision trees are non-deterministic. Furthermore, the lecture by Nando De Freitas here also talks of class probabilities at around 30 minutes.

My question is – how is it possible to get class probabilities from a single decision tree?

As far as I know, the default for a RandomForestClassifier in sklearn is a deterministic decision tree (for something like ExtraTreesClassifier, that is not the case). So, where do these said "class probabilities" for a single tree come from?

Best Answer

Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.

If you don't set max_depth or similar parameters, then the tree will always keep splitting until every leaf is pure, and all the probabilities will be 1 (as Soren says).

Note that this tree is not nondeterministic; rather, given an input, it deterministically produces both a class prediction and a confidence score in the form of a probability.

Verification that this is what's happening:

>>> import numpy as np
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.cross_validation import train_test_split

>>> X, y = make_classification(n_informative=10, n_samples=1000)
>>> Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

>>> clf = DecisionTreeClassifier(max_depth=5)
>>> clf.fit(Xtrain, ytrain)

>>> clf.predict_proba(Xtest[:5])
array([[ 0.19607843,  0.80392157],
       [ 0.9017094 ,  0.0982906 ],
       [ 0.9017094 ,  0.0982906 ],
       [ 0.02631579,  0.97368421],
       [ 0.9017094 ,  0.0982906 ]])

>>> from sklearn.utils import check_array
>>> from sklearn.tree.tree import DTYPE
>>> def get_node(X):
...     return clf.tree_.apply(check_array(X, dtype=DTYPE))
...
>>> node_idx, = get_node(Xtest[:1])
>>> ytrain[get_node(Xtrain) == node_idx].mean()
0.80392156862745101

(In the not-yet-released sklearn 0.17, this get_node helper can be replaced by clf.apply.)

Related Question