Solved – Is decision tree output a prediction or class probabilities

A Random Forest works by aggregating the results of many decision trees. Recently, I was reading about how the RandomForest aggregates the results, and it made me question whether the results from Decision trees are a single class prediction or a probability of each class.

the documentation for predict_proba says

"The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf."

the part about "mean predicted class probabilities" indicates that the decision trees are non-deterministic. Furthermore, the lecture by Nando De Freitas here also talks of class probabilities at around 30 minutes.

My question is – how is it possible to get class probabilities from a single decision tree?

As far as I know, the default for a RandomForestClassifier in sklearn is a deterministic decision tree (for something like ExtraTreesClassifier, that is not the case). So, where do these said "class probabilities" for a single tree come from?

>>> import numpy as np >>> from sklearn.tree import DecisionTreeClassifier >>> from sklearn.datasets import make_classification >>> from sklearn.cross_validation import train_test_split >>> X, y = make_classification(n_informative=10, n_samples=1000) >>> Xtrain, Xtest, ytrain, ytest = train_test_split(X, y) >>> clf = DecisionTreeClassifier(max_depth=5) >>> clf.fit(Xtrain, ytrain) >>> clf.predict_proba(Xtest[:5]) array([[ 0.19607843, 0.80392157], [ 0.9017094 , 0.0982906 ], [ 0.9017094 , 0.0982906 ], [ 0.02631579, 0.97368421], [ 0.9017094 , 0.0982906 ]]) >>> from sklearn.utils import check_array >>> from sklearn.tree.tree import DTYPE >>> def get_node(X): ... return clf.tree_.apply(check_array(X, dtype=DTYPE)) ... >>> node_idx, = get_node(Xtest[:1]) >>> ytrain[get_node(Xtrain) == node_idx].mean() 0.80392156862745101

Best Answer

Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.

If you don't set max_depth or similar parameters, then the tree will always keep splitting until every leaf is pure, and all the probabilities will be 1 (as Soren says).

Note that this tree is not nondeterministic; rather, given an input, it deterministically produces both a class prediction and a confidence score in the form of a probability.

Verification that this is what's happening:

(In the not-yet-released sklearn 0.17, this get_node helper can be replaced by clf.apply.)

Best Answer

Related Solutions

Solved – Why don’t we use regularization on decision tree split

Related Question