Solved – the equivalent of the complexity parameter ( rpart in R ) in python for regression trees (sklearn)

pythonrrpart

The complexity parameter decides when to stop splitting. what is its equivalent in python. As decreasing the cp tends to increase the accuracy in the prediction, so is there a similar parameter in python ?

Best Answer

First, we need to understand what complexity is in the context of rpart implementation of the CART algorithm (Classification and Regression Trees). It is a way to evaluate the quality of a potential split when growing your tree, or more specifically, it is the amount by which splitting that node would decrease the relative error. There are other ways of evaluating split quality (gain ratio, gini index, etc.), but rpart uses complexity.

If you take a look a look at the sklearn documentation, complexity doesn't get mentioned in the possible values for the "criterion" parameter. Instead, you have the option of choosing "gini" for gini index and "entropy" for gain ratio.

Long story short, there is no direct equivalent, as that specific criterion isn't mentioned.

Also, a word of warning about your statement that "decreasing the cp tends to increase the accuracy in the prediction," you need to be careful with that. By decreasing cp, you are essentially making splitting easier, regardless of whether or not a split makes sense in any generalizable way. You are more likely over fitting your model. I would highly suggest doing out of sample validation on your models to find the sweet spot for your tuning parameters.