Solved – Penalizing to prevent overfitting

random forest

I am currently working on a decision tree algorithm. As you might know, decision trees, as you add more inputs/nodes can get very specific, which although makes them good classifiers, also gives them the tendency to overfit.

To illustrate the point, say I have 3 features, each of which can take on 2 different values. I'f I'm not mistaken, on this simple tree, there would be 8 leaves (terminal nodes). Furthermore, let's say (1) the size of the training set is 100 samples, (2) this is a binary classification problem and (3) at the end of one of the leaves 100% of the samples end up in category 1, but this particular leaf only has 2 samples in it. I assume that one's confidence in the accuracy of this leaf dwindles when one takes into account that it only has 2% of the population in it. Therefore, I am curious to if there is any penalizing heuristic pertinent to decision tree that take sparsity of data into account?

To rephrase the question: Is there some heuristic that is used to calculate confidence based on diminished sample size? If for example this leaf is 100% accurate but only contains 2% of the population, but one level up is a node with 90% accuracy which contains 10% of the population, the algorithm would see the node with the higher population as the optimal node to base it's decision on?

Please let me know if my question needs revising or clarification of any kind. Thanks

Best Answer

Penalizing as you would in l1 regression is difficult to reconcile with simple algorithms for greedy tree growth. You'd need to do some sort of global optimization of the tree which becomes unwieldy.

Pruning is the a common approach in single-decisiont-tree analysis but is rarely used in ensemble-of-tree methods like Random Forest (which this question is tagged with) where overfitting can be combated by either increasing the randomization of the models (via bagging, decreasing the number of features examined for each split or reducing the number of potential splits examined for each tree as in Extremely Randomized Trees).

You can also limit the complexity of the tree with a max depth or leaf size parameter.

In a Random Forest model I would do a parameter sweep of the number of in bag cases, the number of features examined and the leaf size or max tree depth and see if that achieves what you want.