Solved – Penalizing to prevent overfitting

random forest

I am currently working on a decision tree algorithm. As you might know, decision trees, as you add more inputs/nodes can get very specific, which although makes them good classifiers, also gives them the tendency to overfit.

To illustrate the point, say I have 3 features, each of which can take on 2 different values. I'f I'm not mistaken, on this simple tree, there would be 8 leaves (terminal nodes). Furthermore, let's say (1) the size of the training set is 100 samples, (2) this is a binary classification problem and (3) at the end of one of the leaves 100% of the samples end up in category 1, but this particular leaf only has 2 samples in it. I assume that one's confidence in the accuracy of this leaf dwindles when one takes into account that it only has 2% of the population in it. Therefore, I am curious to if there is any penalizing heuristic pertinent to decision tree that take sparsity of data into account?

To rephrase the question: Is there some heuristic that is used to calculate confidence based on diminished sample size? If for example this leaf is 100% accurate but only contains 2% of the population, but one level up is a node with 90% accuracy which contains 10% of the population, the algorithm would see the node with the higher population as the optimal node to base it's decision on?

Please let me know if my question needs revising or clarification of any kind. Thanks

Best Answer

Penalizing as you would in l1 regression is difficult to reconcile with simple algorithms for greedy tree growth. You'd need to do some sort of global optimization of the tree which becomes unwieldy.

Pruning is the a common approach in single-decisiont-tree analysis but is rarely used in ensemble-of-tree methods like Random Forest (which this question is tagged with) where overfitting can be combated by either increasing the randomization of the models (via bagging, decreasing the number of features examined for each split or reducing the number of potential splits examined for each tree as in Extremely Randomized Trees).

You can also limit the complexity of the tree with a max depth or leaf size parameter.

In a Random Forest model I would do a parameter sweep of the number of in bag cases, the number of features examined and the leaf size or max tree depth and see if that achieves what you want.

Related Solutions

Is Random Forest a Boosting Algorithm? – Detailed Explanation

Random Forest is a bagging algorithm rather than a boosting algorithm. They are two opposite way to achieve a low error.

We know that error can be composited from bias and variance. A too complex model has low bias but large variance, while a too simple model has low variance but large bias, both leading a high error but two different reasons. As a result, two different ways to solve the problem come into people's mind (maybe Breiman and others), variance reduction for a complex model, or bias reduction for a simple model, which refers to random forest and boosting.

Random forest reduces variance of a large number of "complex" models with low bias. We can see the composition elements are not "weak" models but too complex models. If you read about the algorithm, the underlying trees are planted "somewhat" as large as "possible". The underlying trees are independent parallel models. And additional random variable selection is introduced into them to make them even more independent, which makes it perform better than ordinary bagging and entitle the name "random".

While boosting reduces bias of a large number of "small" models with low variance. They are "weak" models as you quoted. The underlying elements are somehow like a "chain" or "nested" iterative model about the bias of each level. So they are not independent parallel models but each model is built based on all the former small models by weighting. That is so-called "boosting" from one by one.

Breiman's papers and books discuss about trees, random forest and boosting quite a lot. It helps you to understand the principle behind the algorithm.

Solved – Why Decision tree is outperforming Random Forest in this simple case

In addition to @mariodeng's answer which explains why the random forest trained with default parameters is worse here, here's an explanation why it may not be better than single trees in your experiment anyways:

Aggregated/ensemble models are not universally better than their "single" counterparts, they are better if and only if the single models suffer of instability.

With 1000 training rows and only 3 columns, you are in a comfortable training sample size situation in which even a decision tree may get reasonably stable. (For 3d data you can easily check the variation you have in the assignment of input space to the classes when rerunning the experiment.)

If the predictions of the trees are stable, all submodels in the ensemble return the same prediction and then the prediction of the random forest is just the same as the prediction of each single tree.
So then not only will the overall performance be the same, it will be the same cases that are predicted correctly and wrongly, respectively.

This is the case in your example:

table (predict(dtFit, test) [, 2], predict (rfFit, test))

#     0  1
#  0 46  0
#  1  0 54

why not 100% accurate?

You train on data that is not representative for the test cases: the test cases cover regions of the input space that never appear in the training data. There is no way for a model to know which class (if any - or maybe a 3rd? ...) cases far outside training space should belong to.

Particularly for highly nonlinear partitioning models (such as the decision trees), leaving training space will typically rather sooner than later lead to disaster.

If you plan to train on one class only, you need to look into so-called one-class classifiers which try to establish independent boundaries for each class. One-class classification of your toy data should give you the result that the out-of-training-space cases do not belong to any of the known classes.

Decision trees are a partitioning method, they cannot do one-class classification.

Best Answer

Related Solutions

Is Random Forest a Boosting Algorithm? – Detailed Explanation

Solved – Why Decision tree is outperforming Random Forest in this simple case

Related Question