Solved – Regression Trees’ greedy algorithm in Hastie et al. (2009)

I am reading The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2009), more specifically the section on regression decision trees (p. 307 of the book). There is something I do not understand about their splitting algorithm. The authors are explaining the mechanism to derive the splitting variable and the split point; they write $-$ my emphasis:

"If we adopt as our criterion minimization of the sum of squares $\sum(y_i-f(x_i))^2$, it is easy to see that the best $\hat{c}_m$ is just the average $y_i$ in region $R_m$:

$$ \hat{c}_m=\text{ave}(y_i|x_i \in R_m) $$

Now finding the best binary partition in terms of minimum sum of squares
is generally computationally infeasible. Hence we proceed with a greedy
algorithm. Starting with all of the data, consider a splitting variable $j$ and split point $s$ , and define the pair of half-planes

$$ R_1(j,s)=\{X|X_j\leq s\} \quad \text{and} \quad R_2(j,s)=\{X|X_j > s\} $$

Then we seek the splitting variable $j$ and split point $s$ that solve

$$ \min_{j,s} \left[\min_{c_1} \sum_{x_i \in R_1(j,s)}(y_i-c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,s)}(y_i-c_2)^2 \right] $$

For any choice $j$ and $s$, the inner minimization is solved by

$$ \hat{c}_1=\text{ave}(y_i|x_i \in R_1(j,s)) \quad \text{and} \quad \hat{c}_2=\text{ave}(y_i|x_i \in R_2(j,s)) $$

For each splitting variable, the determination of the split point $s$ can
be done very quickly and hence by scanning through all of the inputs,
determination of the best pair $(j, s)$ is feasible."

I am struggling to understand how the greedy algorithm they describe is different from finding the best binary partition. Is it simply because we are limiting the set of admissible partitions to partitions of the form "variable $j$ is above/below $s$", hence if we were to consider all possible partitions we should also consider much more complex partitions?

Best Answer

This simple example might illustrate why the 'greedy algorithm' is quicker than 'finding the best partition', but also why it can be less good.

Imagine a binary outcome with two features that follows the following rule: the outcome is A if feature 1 is non-negative and feature 2 is negative or feature 1 is negative and feature 2 is non-negative. The outcome is B in every other case.

The optimal partition is very obvious: if ((f1 >= 0) & (f2 < 0)): the outcome is A ifelse (f1 < 0) & (f2 >= 0: the outcome is A. else the outcome is B.

However, to get this outcome it is necessary to evaluate two subsequent rules jointly (both regarding the f1 feature and the f2 feature). In other words, two rules would be evaluated simultaneously. This can get increasingly complex as the number of features increases (in this stylised example it would not be such a problem). Therefore, the greedy algorithm only looks at the single best decision rule of the form feature > a that can be made next.

Now if the algorithm can only choose f1 or f2 and a single split point a, which is the greedy algorithm, it very much depends on the latent structure of the data which one would be picked first (just think of some possible data structures which include the four possibilities [(f1 <0 , f2 < 0), (f1 < 0, f2 >= 0, (f1 >= 0, f2 < 0, (f1 >= 0, f2 >= 0]. It will definitely be fast, because it 'only' has to evaluate two variables spaces and all possibly binary partitions.

The best partition would require evaluating all possible partitions of the first choice variable, and all possible subsequenty partition of the next etc. As you can imagine, this is much more complex than the greedy approach.

This difference in complexity is what the authors refer to.

Best Answer

Related Solutions

Solved – Minimizing the misclassification rate

Related Question