Bias-Variance – Mathematical Intuition of the Bias-Variance Equation

biasintuitionvariance

I recently asked a question seeking a mathematical interpretation/intuition behind the elementary equation relating sample mean and variance: $ E[X^2] = Var(X) +(E[X])^2$, geometric or otherwise.

But now I'm curious about the superficially similar bias-variance tradeoff equation.

$$
\begin{eqnarray}
\text{MSE}(\hat{\theta}) = E [(\hat{\theta}-\theta)^2 ]
&=& E[(\hat{\theta} – E[\hat\theta])^2] + (E[\hat\theta] – \theta)^2\\
&=& \text{Var}(\hat\theta) + \text{Bias}(\hat\theta,\theta)^2
\\
\end{eqnarray}
$$
(formulas from Wikipedia)

To me there is a superficial similarity with the bias-variance tradeoff equation for regression: three terms with squares and two adding to the other. Very Pythagorean looking. Is there a similar vector relationship including orthogonality for all these items? Or is there some other related mathematical interpretation that applies?

I am seeking a mathematical analogy with some other mathematical objects that might shed light. I am not looking for the accuracy-precision analogy which is well covered here. But if there are non-technical analogies that people can give between the bias-variance tradeoff and the much more basic mean-variance relation, that would be great too.

Best Answer

The similarity is more than superficial.

The "bias-variance tradeoff" can be interpreted as the Pythagorean Theorem applied to two perpendicular Euclidean vectors: the length of one is the standard deviation and the length of the other is the bias. The length of the hypotenuse is the root mean squared error.

A fundamental relationship

As a point of departure, consider this revealing calculation, valid for any random variable $X$ with a finite second moment and any real number $a$. Since the second moment is finite, $X$ has a finite mean $\mu=\mathbb{E}(X)$ for which $\mathbb{E}(X-\mu)=0$, whence

$$\eqalign{ \mathbb{E}((X-a)^2) &= \mathbb{E}((X-\mu\,+\,\mu-a)^2) \\ &= \mathbb{E}((X-\mu)^2) + 2 \mathbb{E}(X-\mu)(\mu-a) + (\mu-a)^2 \\ &= \operatorname{Var}(X) + (\mu-a)^2.\tag{1} }$$

This shows how the mean squared deviation between $X$ and any "baseline" value $a$ varies with $a$: it is a quadratic function of $a$ with a minimum at $\mu$, where the mean squared deviation is the variance of $X$.

The connection with estimators and bias

Any estimator $\hat \theta$ is a random variable because (by definition) it is a (measurable) function of random variables. Letting it play the role of $X$ in the preceding, and letting the estimand (the thing $\hat\theta$ is supposed to estimate) be $\theta$, we have

$$\operatorname{MSE}(\hat\theta) = \mathbb{E}((\hat\theta-\theta)^2) = \operatorname{Var}(\hat\theta) + (\mathbb{E}(\hat\theta)-\theta)^2.$$

Let's return to $(1)$ now that we have seen how the statement about bias+variance for an estimator is literally a case of $(1)$. The question seeks "mathematical analogies with mathematical objects." We can do more than that by showing that square-integrable random variables can naturally be made into a Euclidean space.

Mathematical background

In a very general sense, a random variable is a (measurable) real-valued function on a probability space $(\Omega, \mathfrak{S}, \mathbb{P})$. The set of such functions that are square integrable, which is often written $\mathcal{L}^2(\Omega)$ (with the given probability structure understood), almost is a Hilbert space. To make it into one, we have to conflate any two random variables $X$ and $Y$ which don't really differ in terms of integration: that is, we say $X$ and $Y$ are equivalent whenever

$$\mathbb{E}(|X-Y|^2) = \int_\Omega |X(\omega)-Y(\omega)|^2 d\mathbb{P}(\omega) = 0.$$

It's straightforward to check that this is a true equivalence relation: most importantly, when $X$ is equivalent to $Y$ and $Y$ is equivalent to $Z$, then necessarily $X$ will be equivalent to $Z$. We may therefore partition all square-integrable random variables into equivalence classes. These classes form the set $L^2(\Omega)$. Moreover, $L^2$ inherits the vector space structure of $\mathcal{L}^2$ defined by pointwise addition of values and pointwise scalar multiplication. On this vector space, the function

$$X \to \left(\int_\Omega |X(\omega)|^2 d\mathbb{P}(\omega)\right)^{1/2}=\sqrt{\mathbb{E}(|X|^2)}$$

is a norm, often written $||X||_2$. This norm makes $L^2(\Omega)$ into a Hilbert space. Think of a Hilbert space $\mathcal{H}$ as an "infinite dimensional Euclidean space." Any finite-dimensional subspace $V\subset \mathcal{H}$ inherits the norm from $\mathcal{H}$ and $V$, with this norm, is a Euclidean space: we can do Euclidean geometry in it.

Finally, we need one fact that is special to probability spaces (rather than general measure spaces): because $\mathbb{P}$ is a probability, it is bounded (by $1$), whence the constant functions $\omega\to a$ (for any fixed real number $a$) are square integrable random variables with finite norms.

A geometric interpretation

Consider any square-integrable random variable $X$, thought of as a representative of its equivalence class in $L^2(\Omega)$. It has a mean $\mu=\mathbb{E}(X)$ which (as one can check) depends only on the equivalence class of $X$. Let $\mathbf{1}:\omega\to 1$ be the class of the constant random variable.

$X$ and $\mathbf{1}$ generate a Euclidean subspace $V\subset L^2(\Omega)$ whose dimension is at most $2$. In this subspace, $||X||_2^2 = \mathbb{E}(X^2)$ is the squared length of $X$ and $||a\,\mathbf{1}||_2^2 = a^2$ is the squared length of the constant random variable $\omega\to a$. It is fundamental that $X-\mu\mathbf{1}$ is perpendicular to $\mathbf{1}$. (One definition of $\mu$ is that it's the unique number for which this is the case.) Relation $(1)$ may be written

$$||X - a\mathbf{1}||_2^2 = ||X - \mu\mathbf{1}||_2^2 + ||(a-\mu)\mathbf{1}||_2^2.$$

It indeed is precisely the Pythagorean Theorem, in essentially the same form known 2500 years ago. The object $$X-a\mathbf{1} = (X-\mu\mathbf{1})-(a-\mu)\mathbf{1}$$ is the hypotenuse of a right triangle with legs $X-\mu\mathbf{1}$ and $(a-\mu)\mathbf{1}$.

If you would like mathematical analogies, then, you may use anything that can be expressed in terms of the hypotenuse of a right triangle in a Euclidean space. The hypotenuse will represent the "error" and the legs will represent the bias and the deviations from the mean.