Solved – Complexity Parameter in Decision Tree

cartrself-study

As the complexity parameter is calculated? What is the meaning of it?

From what I read, the cp is a value at which the tree makes divisions in the nodes until the reduction in the relative error is less than a certain value.

There are places I read that say the CP affects only the growth of the tree and others say that interferes with pruning too. For min appears that it interferes only in growth but not sure.

I am using rpart () package to create trees, in the case of the classification tree exists missclassification rate to evaluate the ratings, but in the case of regression is not anything to evaluate the predictions beyond the MSE?

Best Answer

This is answered in this rpart resource. From p. 25:

For regression models (see next section) the scaled cp has a very direct interpretation: if any split does not increase the overall $R^2$ of the model by at least cp (where $R^2$ is the usual linear-models definition) then that split is decreed to be, a priori, not worth pursuing. The program does not split said branch any further, and saves considerable computational effort.

That same page gives this formula for how the cp parameter affects calculation of a tree's risk:

$$R_{cp}(T) ≡ R(T) + cp ∗ |T| ∗ R(T_1)$$

($T_1$ here is a tree with no splits, $|T|$ the splits in the tree. The full formal definition of risk is outside the scope of your question, but for reference the definition is on p. 4.)