Bias-Variance Tradeoff – Why the Variance of the Error Term is Always 1

biasself-studyvariance

I'm reading Introduction to Statistical Learning. The relevant part is referenced here: Proof/Derivation of Residual Sum of Squares (Based on Introduction to Statistical Learning).

When the author shows graphs that illustrate "Bias vs Variance Tradeoff" (as in Figure 2.12), the ${\rm Var}(\varepsilon)$ is always $1$ (note the dashed lines in the figures):

The conditions of $\varepsilon$ are clarified elsewhere, as on page 16:

$\varepsilon$ is a random error term, which is independent of $X$ and has mean zero.

… and there is some explanation about going from "random error term" to "irreducible error":

However, even if it were possible to form a perfect estimate for
$f$, so that our estimated response took the form $\hat{Y} = f(X)$, our prediction would still have some error in it! This is because $Y$ is also a function of $\varepsilon$, which, by definition, cannot be predicted using $X$. Therefore, variability associated with $\varepsilon$ also affects the accuracy of our predictions.

But I don't see anywhere in the other SO questions, nor in the book: why is $Var(\varepsilon)$ always at 1?

Is it because the "mean is zero"? I don't think so; I could describe a dataset with mean of zero but a variance of $\ne 1$.
Is it because, as described elsewhere, the "the error term $\varepsilon$ is normally distributed"? I don't know enough about the normal distribution; is the variance of a normal distribution is always equal to some value?

EDIT

In looking for help in Wikipedia's MSE article, I expected to find a consistent formula with the "three fundamental quantities" (i.e., the variance, the bias, and the variance of the error terms), but I didn't. Can someone tell me why the Wikipedia doesn't list the variance of error terms:

$$\operatorname{MSE}(\hat{\theta})=\operatorname{Var}(\hat{\theta})+ \left(\operatorname{Bias}(\hat{\theta},\theta)\right)^2$$

Best Answer

It isn't because the mean is $0$ or because the error term is normally distributed. In fact, the normal distribution is the only 'named' distribution where the mean and the variance are independent of each other (see: What is the most surprising characterization of the Gaussian (normal) distribution?).

More generally, my strong guess is that the purpose of setting the variance of the errors equal to $1$ is pedagogical. Everything in the figures can be related to the variance of the error term because the unit of measurement in the figures is $1$ and that was set as the variance of the error term.

Regarding the Wikipedia article, be aware that the variance of theta is a function of the variance of the error term, so ${\rm Var}(\hat\theta)$ does include ${\rm Var}(\varepsilon)$ (it's just out of sight).

Best Answer

Related Solutions

Solved – How to increasing the dimension increase the variance without increasing the bias in kNN

Solved – Difference in expressions of variance and bias between MSE and MSPE

Related Question