Regression forests are basically random forests, however used for regression. They basically use the same framework as decision forests use for classification with a few parts exchanged.
Two of these parts are:
- How to calculate the information gain when deciding how to split at a node?
- How do we make a prediction when we reach a leaf?
An interesting report (however 150 pages strong) concerning this topic is the following:
http://research.microsoft.com/apps/pubs/default.aspx?id=155552
This report touches the above mentioned problems, I however do not really understand the possible solutions.
1) Page 52, Appendix A. They define the continuous formulation of information gain to be:
$ I_j = \sum_{v \in S_j} log(|\Lambda_y(v)|) – \sum_{i \in \{L,R \}} (\sum_{v \in S_j^i} log(|\Lambda_y(v)|)) $. As I dont really know where else to ask: Does anyone know what || is in this case? How would you implement a function to compute I?
2) When we reach a leaf we have a number of data points reaching this leaf, say of the form (x,y). From my understanding we would now fit a function f(x) to these data points (e.g. by least squares) and would then use f(x) to predict an y for a future testing point x. In the report (page 50) they talk about a probabilistic linear fitting, that returns a conditional distribution p(y|x). How could that be done? I dont understand how we can compute a probability density function here.
I hope anyone is familiar with the above mentioned report and might be able to help 🙂
Thank you!
Best Answer
You can check out my personal blog for more information about regression trees: Decision Trees, Random Forests and XGBoost
Hope that answers your question.