Solved – Are tree estimators ALWAYS biased

biascart

I'm doing a homework on Decision Trees, and one of the questions I have to answer is "Why are estimators built out of trees biased, and how does bagging help reduce their variance?".

Now, I know that overfitted models tend to have really low bias, because they try to fit all the data points. And, I had a script in Python that fitted a tree to some dataset (with a single feature. It was just a sinusoid, with some off points, picture below). So, I wondered "well, if I reeeeally overfit the data, can I get the bias to zero?". And, it turned out that, even with a depth of 10000, there are still some points through which the curve doesn't pass.

I tried searching for why, but I couldn't really find an explanation. I'm guessing that there may be some trees that would perfectly go through all the points, and that the ones I got were just "bad luck". Or that maybe a different dataset could've given me an unbiased result (maybe a perfect sinusoid?). Or even that, maybe the cuts made at the beginning made it impossible for further cuts to fully separate all the points.

So, taking into consideration this dataset (since it might be different for others), my question is: is it possible to overfit a tree to the point where the bias goes to zero, or is there always gonna be some bias, even if really small? And if there's always at least some bias, why does that happen?

P.S. I don't know if it might be relevant, but I used the DecisionTreeRegressor from sklearn to fit the model to the data.

Best Answer

A decision tree model is no more always bias than any other learning model.

To illustrate, let's look at two examples. Let $X$ be a random uniform variable on $[0, 1]$. Here are possible statistical processes

Truth 1: $Y$ given $X$ is an an indicator function of X, plus noise:

$$ Y \mid X \sim I_{< .5}(X) + N(0, 1) $$

Truth 2: $Y$ given $X$ is a linear function of $X$, plus noise:

$$ Y \mid X \sim X + N(0, 1) $$

If we fit a decision tree in both situations, the model is un-biased in the first situation, but is biased in the second. This is because a one split binary tree can recover the true underlying data model in the first situation. In the second, the best a tree can do is approximate the linear function by stir stepping at ever finer intervals - a tree of finite depth can only get so close.

If we fit a linear regression in both situations, the model is biased in the first situation, but is un-biased in the second.

So, to know whether a model is biased, you need to know what the true underlying data mechanism is. In real life situations, you just never know this, so you can never really say whether a model in real life is biased or not. Sometimes, we think we are totally right for a long time, but then the bias emerges with deeper understanding (Newtonian Gravity to Einstein Gravity is at least an apocryphal example).

In some sense, we expect most real world processes (with some exceptions) to be so unknowable, that a reasonable enough approximation of the truth is that all our models are biased. I some how doubt the question is asking for a deep philosophical discussion about the essential futility of modeling complex statistical process, but it is fun to think about.

Related Solutions

Solved – How does the complexity parameter correspond to the number of splits in cross validation in rpart

There is one tree created, which is definitely overfitting the data. The specified minsplit essentially creates a tree that categorizes each terminal node into either all "present" or all "absent". rpart will not prune the tree for you, but can provide cross-validation for you to select the best subtree (i.e. select the complexity parameter $\alpha$). The best tree is any subset of the initial tree; below are a few options:

library(rpart.plot)
prp(tree,extra=1) #Initial tree with 16 splits
prp(prune(tree,cp=0.042),extra=1) #Subtree with 10 splits
prp(prune(tree,cp=0.068),extra=1) #Subtree with 5 splits
prp(prune(tree,cp=0.14),extra=1) #Subtree with 1 split

To decide which subtree is best, we have to perform cross-validation. First we have to determine the possible $\alpha$'s that would yield a subtree (from the initial tree). Then we divide the data into 10 groups and build 10 trees with the 'leave one group out' approach using a possible $\alpha$ to prune the tree. The left out group can determine which $\alpha$ worked best. The technical details can be seen in the rpart vignette

The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree. For this dataset, I don't think CART fits the data that well. If you perform 81-fold cross-validation (i.e. leave one observation out), you'll see five splits seems like the best tree. If you're looking for a model with better prediction accuracy, perhaps you should consider building a random forest.

Solved – Why does a bagged tree / random forest tree have higher bias than a single decision tree

I will accept the answer on 1) from Kunlun, but just to close this case, I will here give the conclusions on the two questions that I reached in my thesis (which were both accepted by my Supervisor):

1) More data produces better models, and since we only use part of the whole training data to train the model (bootstrap), higher bias occurs in each tree (Copy from the answer by Kunlun)

2) In the Random Forests algorithm, we limit the number of variables to split on in each split - i.e. we limit the number of variables to explain our data with. Again, higher bias occurs in each tree.

Conclusion: Both situations are a matter of limiting our ability to explain the population: First we limit the number of observations, then we limit the number of variables to split on in each split. Both limitations leads to higher bias in each tree, but often the variance reduction in the model overshines the bias increase in each tree, and thus Bagging and Random Forests tend to produce a better model than just a single decision tree.

Best Answer

Related Solutions

Solved – How does the complexity parameter correspond to the number of splits in cross validation in rpart

Solved – Why does a bagged tree / random forest tree have higher bias than a single decision tree

Related Question