Solved – Regression tree algorithm with linear regression models in each leaf

cartrregressionrpart

Short version: I'm looking for an R package that can build decision trees whereas each leaf in the decision tree is a full Linear Regression model. AFAIK, the library rpart creates decision trees where the dependent variable is constant in each leaf. Is there another library (or a rpart setting I'm not aware of) that can build such trees?

Long version: I'm looking for an algorithm that builds a decision tree based on a training data set. Each decision in the tree splits the training data set into two parts, according to a condition on one of the independent variables. The root of the tree contains the full data set, and each item in the data set is contained in exactly one leaf node.

The algorithm goes like this:

  1. Begin with the full dataset, which is the root node of the tree. Pick this node and call it $N$.
  2. Create a Linear Regression model on the data in $N$.
  3. If $R^2$ of $N$'s linear model is higher than some threshold $\theta_{R^2}$, then we're done with $N$, so mark $N$ as a leaf and jump to step 5.
  4. Try $n$ random decisions, and pick the one that yields the best $R^2$ in the subnodes:
    • Pick a random independent variable $v_i$, as well as a random threshold $\theta_i$.
    • The decision $v_i \leq \theta_i$ splits the data set of $N$ into two new nodes, $\hat{N}$ and $\tilde{N}$.
    • Create Linear Regression models on both $\hat{N}$ and $\tilde{N}$, and calculate their $R^2$ (call them $\hat{r}$ and $\tilde{r}$).
    • From all those $n$ tuples $(v_i, \theta_i, \hat{r}, \tilde{r})$, select the one with the maximal $min(\hat{r}, \tilde{r})$. This yields a new decision in the tree, and $N$ has two new subnodes $\hat{N}$ and $\tilde{N}$.
  5. We have finished processing $N$. Pick a new node $N$ which has not yet been processed and go back to step 2. If all nodes have been processed, the algorithm ends.

This will recursively build a decision tree which splits the data into smaller parts, and calculates a Linear Model on each of those parts.

Step 3 is the exit condition, which prevents the algorithm from overfitting. Of course, there are other possible exit conditions:

  • Exit if $N$'s depth in the tree is above $\theta_{depth}$
  • Exit if the data set in $N$ is smaller than $\theta_{data set}$

Is there such an algorithm in an R package?

Best Answer

While they work differently than your algorithm, I believe you'll find mob() and FTtree interesting. For Zeileis' mob see http://cran.r-project.org/web/packages/party/vignettes/MOB.pdf For FTtree,Gama's functional trees an implementation is available in Weka and thus RWeka. See http://cran.r-project.org/web/packages/RWeka/index.html for details.