Solved – Residual plot for regression tree: What should it look like

cartdata visualizationresiduals

  • I realize that decision trees are nonparametric methods

  • What should residual vs. actual/fitted look like for a well behaved regression tree?

  • My argument would be that since each observation assigned to a terminal node is assigned (as a predicted value) the average of the dependent variable at that terminal node, you would expect the conditional distribution (that is, for each node) to be approximately normal.

  • I have attached two plots for my decision tree (validates at 63% on test set, so kind of weak), residuals vs. fitted and residuals vs. actual
    -Basically, my question: wouldn't a strong regression tree look like a step-function of sorts?

Residuals vs. actual

Resdiduals vs predicted

Best Answer

The prediction will look like a step function, but not the plots you include.

The residual vs actual plot looks ok to me. I have seen plots like that one even in regression. In regression, the diagonal patterns pop up when you have many observations with the same $X$s. Take a group that have the same prediction and index with $i$. The idea is that if the $X$s are the same, then the plot will be $\hat{y}_{i} - y_{i} = r_{i}$ but $\hat{y}_{i}=p$ so on the plane with $(y,r)$ axes it looks like a straight diagonal line. In regression tree you have many groups where the prediction is identical, so the pattern should come up.

The second plot does look strange. Is that plot for the train set or the test set? If its the train set, is every point visible? In the train set I would expect residuals to be centered at 0, assuming that you built the tree to minimize the unexplained variance and that each observation has the same weight.