Solved – Using a histogram to estimate class label densities in a tree learner

histogrammachine learning

In a sequential (on-line) tree algorithm, I'm trying to estimate class label densities using a histogram. The algorithm grows a tree by generating test functions at each decision node, which correspond to the features in the data set. That is to say each feature is used to create a test in the form of $g(x) > 0$ that's used to decide the left/right propagation of samples down each node of the binary tree. If the result of that test exceeds some threshold theta, then samples are propagated down the right branch, otherwise they're propagated down the left branch.

When training the tree, I have to choose the best test to use at each decision node, which is accomplished using a quality measurement in the form

gain with respect to a test s

Where $R_{jls}$ and $R_{jrs}$ are the left and right partitions made by test s, which corresponds to the data greater than or less than the threshold theta, $(p_i^j)$ is the label density of class $i$ in node $j$ (or $jls/jrs$ for the label densities of class $i$ in the left or right child nodes), $K$ is the total number of classes, and $|x|$ denotes the number of samples in a partition.

In other words, the gain with respect to a test $s = g(x)$ depends on the entropy of class labels within that feature. This requires knowing the class label density at each decision node.

So, my uncertainty lies in how can this be accomplished with a histogram. A histogram for each feature would have N bins of size range/N, where range is the difference between the max and min values that each feature takes. If we don't have a priori knowledge of this range, we can keep track of the max/min values as we get more training data. This histogram would track the number of samples falling within each range, but it tells nothing about the class labels.

Another option is to have a histogram for each class, for each feature, for each node. So, you'd be keeping track of the number of samples having a particular class label given a range of feature values (each bin).

This solution seems to be more in line with the above equation, since we need the label densities for each class and for each node, but I just want to verify that I'm on the right track. For reference, see part 2.1.2 of this paper.

Saffari et al., 2009. On-line Random Forests. Computer Vision
Workshops (ICCV Workshops), 2009 IEEE 12th International Conference
.
DOI 10.1109/ICCVW.2009.5457447

Best Answer

I didn't understand very well what you want to do with histograms.

If you have a set of features $R_j$ and you are able to generate a split $s$ you can compute $Q(R_j,s)$ easily.Given the split you directly have the two subset of records $R_{jls}$ and $R_{jrs}$ and you need only to compute the probability distribution across classes over these three sets. More specifically you only need: $$p_i^j \ \ i=1,\dots,K$$ $$p_i^{jls} \ \ i=1,\dots,K$$ $$p_i^{jrs} \ \ i=1,\dots,K$$ And you can compute these values counting the number of records for each class for a particular set of records.

$$p_i^j = \frac{|R^i_j|}{|R_j|} $$ $$p_i^{jls} = \frac{|R^i_{jls}|}{|R_{jls}|} $$ $$p_i^{jrs} = \frac{|R^i_{jrs}|}{|R_{jrs}|} $$ Where, for example, $R_j^i$ are the records of class $i$ in the set $R_j$.

You only need to keep track of the records in a node, that is associated to a set of records, and be able to find a split.

If you have all the records in a particular node you might also be able to find splits, for example with an exhaustive search across all possible splits for each feature. If you have a record $(\mathbf{x},y)=(x_1,\dots,x_n,y)$ you put it in a different set according to a test on the feature $x_i$ just looking if $$x_i < \vartheta$$ (if $x_i$ is a continuous parameter)

When you have a split you can compute $Q$, then varying $\vartheta$ a varying feature you can find the best split according to $Q$ measure.

Related Question