Solved – Are tree estimators ALWAYS biased

biascart

I'm doing a homework on Decision Trees, and one of the questions I have to answer is "Why are estimators built out of trees biased, and how does bagging help reduce their variance?".

Now, I know that overfitted models tend to have really low bias, because they try to fit all the data points. And, I had a script in Python that fitted a tree to some dataset (with a single feature. It was just a sinusoid, with some off points, picture below). So, I wondered "well, if I reeeeally overfit the data, can I get the bias to zero?". And, it turned out that, even with a depth of 10000, there are still some points through which the curve doesn't pass.

enter image description here

I tried searching for why, but I couldn't really find an explanation. I'm guessing that there may be some trees that would perfectly go through all the points, and that the ones I got were just "bad luck". Or that maybe a different dataset could've given me an unbiased result (maybe a perfect sinusoid?). Or even that, maybe the cuts made at the beginning made it impossible for further cuts to fully separate all the points.

So, taking into consideration this dataset (since it might be different for others), my question is: is it possible to overfit a tree to the point where the bias goes to zero, or is there always gonna be some bias, even if really small? And if there's always at least some bias, why does that happen?

P.S. I don't know if it might be relevant, but I used the DecisionTreeRegressor from sklearn to fit the model to the data.

Best Answer

A decision tree model is no more always bias than any other learning model.

To illustrate, let's look at two examples. Let $X$ be a random uniform variable on $[0, 1]$. Here are possible statistical processes

Truth 1: $Y$ given $X$ is an an indicator function of X, plus noise:

$$ Y \mid X \sim I_{< .5}(X) + N(0, 1) $$

Truth 2: $Y$ given $X$ is a linear function of $X$, plus noise:

$$ Y \mid X \sim X + N(0, 1) $$

If we fit a decision tree in both situations, the model is un-biased in the first situation, but is biased in the second. This is because a one split binary tree can recover the true underlying data model in the first situation. In the second, the best a tree can do is approximate the linear function by stir stepping at ever finer intervals - a tree of finite depth can only get so close.

If we fit a linear regression in both situations, the model is biased in the first situation, but is un-biased in the second.

So, to know whether a model is biased, you need to know what the true underlying data mechanism is. In real life situations, you just never know this, so you can never really say whether a model in real life is biased or not. Sometimes, we think we are totally right for a long time, but then the bias emerges with deeper understanding (Newtonian Gravity to Einstein Gravity is at least an apocryphal example).

In some sense, we expect most real world processes (with some exceptions) to be so unknowable, that a reasonable enough approximation of the truth is that all our models are biased. I some how doubt the question is asking for a deep philosophical discussion about the essential futility of modeling complex statistical process, but it is fun to think about.