Solved – Regression forest: Leaf node and information gain

cartentropyinformation theoryrandom forestregression

Regression forests are basically random forests, however used for regression. They basically use the same framework as decision forests use for classification with a few parts exchanged.

Two of these parts are:

  1. How to calculate the information gain when deciding how to split at a node?
  2. How do we make a prediction when we reach a leaf?

An interesting report (however 150 pages strong) concerning this topic is the following:
http://research.microsoft.com/apps/pubs/default.aspx?id=155552

This report touches the above mentioned problems, I however do not really understand the possible solutions.

1) Page 52, Appendix A. They define the continuous formulation of information gain to be:
$ I_j = \sum_{v \in S_j} log(|\Lambda_y(v)|) – \sum_{i \in \{L,R \}} (\sum_{v \in S_j^i} log(|\Lambda_y(v)|)) $. As I dont really know where else to ask: Does anyone know what || is in this case? How would you implement a function to compute I?

2) When we reach a leaf we have a number of data points reaching this leaf, say of the form (x,y). From my understanding we would now fit a function f(x) to these data points (e.g. by least squares) and would then use f(x) to predict an y for a future testing point x. In the report (page 50) they talk about a probabilistic linear fitting, that returns a conditional distribution p(y|x). How could that be done? I dont understand how we can compute a probability density function here.

I hope anyone is familiar with the above mentioned report and might be able to help 🙂
Thank you!

Best Answer

  1. Information Gain is just a metric used to choose the attribute on which the data has to be split at the node. One important point to note is that, while Information Gain is the metric used in classification, Standard Deviation is used as the metric in case of Regression Forest. While the attribute with the highest Information Gain is chosen as the attribute to split the data on in case of classification, the attribute with the highest decrease in Standard Deviation is used in case of regression.
  2. After the regression tree is built, each leaf node in the tree might contain multiple samples from the training set. The average of the target variables of this leaf node is the output for any test sample that follows this decision path.

You can check out my personal blog for more information about regression trees: Decision Trees, Random Forests and XGBoost

Hope that answers your question.