Solved – “Proof?” of Bias/Variance trade-off

I'm unsatisfied with the "proof" that there is a trade-off between variance and bias in Hastie's text (The Element of Machine Learning). I also haven't found anything on this site that speaks to my particular concern.

In section 7.3, they show the following. Assume that the target $Y$ has the relationship $Y = f(X) + \varepsilon$ where $E(\varepsilon) = 0$ and $Var(\varepsilon) = \sigma_\varepsilon^2$. Since the target has error in it, they call $\sigma_\varepsilon^2$ "irreducible error".

$\begin{aligned}
Err(x_0) &= E\big[(Y – \hat{f}(x_0))^2 \mid X = x_0\big] \\
&= \sigma_\varepsilon^2 + \big[ E(\hat{f}(x_0) – f(x_0) \big]^2 + E\big[\hat{f}(x_0)-E(\hat{f}(x_0) \big]\\
&= Irreducible Error + Bias^2 + Variance
\end{aligned}$

Fine. Obviously if Err(x) were constant, than an increase in bias would lead to a decrease in variance. But, as we change the model, who is to say that the error is constant?? Wouldn't a great model have low bias and variance whereas a horrible model could have high variance and bias? So, I see no evidence here that a more complex model must have lower bias but greater variance. Is this calculation proving anything?

I would like to know if there's a rigorous, mathematical argument showing that this tradeoff is fundamental. Usually I just see intuitive arguments or case studies in the context of one specific type of analysis (such as linear regression)

Best Answer

First write the statement mathematically: define $\mathcal{F}$ as a function space, $\hat{f}_{n,\mathcal{F}} = \arg\min_{\hat{f}\in\mathcal{F}}\sum_{i=1}^n (y_i - \hat{f}(x_i))^2$ as the optimal regression in $\mathcal{F}$, $Bias^2(\hat{f}_{n,\mathcal{F}}(x_0)) = [E(\hat{f}_{n,\mathcal{F}}(x_0)) - f(x_0)]^2$ and $Variance(\hat{f}_{n,\mathcal{F}}(x_0)) = Var[\hat{f}_{n,\mathcal{F}}(x_0)]$ as you defined, where the expectation and variance are taken on the training data.

You asked that whether a more complex model must have lower bias but greater variance, which can be written as the statement: if $\mathcal{F_1} \subset \mathcal{F_2}$, $Bias^2(\hat{f}_{n, \mathcal{F}_1}(x_0)) \ge Bias^2(\hat{f}_{n, \mathcal{F}_2}(x_0))$ and $Variance(\hat{f}_{n, \mathcal{F}_1}(x_0)) \le Variance^2(\hat{f}_{n, \mathcal{F}_2}(x_0))$.

I can find a counterexample as following: assume the true $f(x) = 1$ with $\sigma_\epsilon = 0$, and consider $\mathcal{F_1} = \{ax\}$, $\mathcal{F_2} = \{ax+b\}$, number of training data $n = 2$. It can be computed that $\hat{f}_{2, \mathcal{F}_1}(x_0) = \frac{x_1 + x_2}{x_1^2 + x_2^2}x_0$, $\hat{f}_{2, \mathcal{F}_2}(x_0) = 1$. The second has zero bias and variance, which are both lower than the first.

It shows that a more complex model may have both lower bias and variance.

Best Answer

Related Solutions

Bias-Variance Tradeoff – Why the Variance of the Error Term is Always 1

Solved – Ridge Regression: how to show squared bias increases as $\lambda$ increases

Related Question