Regression – Can Predicted Values in Decision Trees and Regression Fall Outside the Training Data Range?

cartpredictive-modelsrandom forestregression

When it comes to decision trees, can the predicted value lay outside of the range of the training data?

For example, if the training data set range of the target variable is 0-100, when I generate my model and apply it to something else, can my values be -5? or 150?

Given that my understanding of decision tree regression is that it is still a rules based – left/right progression and that at the bottom of the tree in the training set it can never see a value outside a certain range, it will never be able to predict it?

Best Answer

You are completely right: classical decision trees cannot predict values outside the historically observed range. They will not extrapolate.

The same applies to random forests.

Theoretically, you sometimes see discussions of somewhat more elaborate architectures (botanies?), where the leaves of the tree don't give a single value, but contain a simple regression, e.g., regressing the dependent variable on a particular numerical independent variable. Navigating through the tree would give you a rule set on which numerical IV to regress the DV on in what case. In such a case, this "bottom level" regression could be extrapolated to yield not-yet observed values.

However, I don't think standard machine learning libraries offer this somewhat more complex structure (I recently looked for this through the CRAN Task Views for R), although there should really not be anything complex about it. You might be able to implement your own tree containing regressions in the leaves.

Related Question