Solved – Relation between MSE and Bias-Variance

bias-variance tradeoffmse

If MSE is

$$
\mathrm{MSE}(\hat Y) = \mathrm{Var}(\hat Y) + \mathrm{Bias}^2(\hat Y)
$$

and the Bias-Variance decomposition is given by

$$
\mathrm{Err}(\hat Y\,|\,X=x_0) = \mathrm{Var}(\hat Y) + \mathrm{Bias}^2(\hat Y) + \sigma^2
$$

so it looks like they're related, the second one being for a single event, while the first one is a mean. But it's not clear how to get from one to the other. So what is the relation?

(I have copied the equations from different sources, and there are slight notational differences, e.g. $\mathrm{Bias}(\hat Y)$ vs $\mathrm{Bias}(\hat Y, Y)$, and it's not clear to me if these are equivalent or not, so please point out/correct any misuse of notation.)

Best Answer

The two formulas express the bias variance trade-off in two different contexts (actually related to each-other). It is confusing to use the letter $Y$ in both cases where they actually stand for different things. For the first case, I'm going to use the letter $F$ instead.

For an estimator

$$MSE(\hat F)=Var(\hat F)+Bias^2(\hat F)$$

In this case, you have a model with a parameter $\theta$ giving you the distribution of your data. You want to make an inference on this parameter, more precisely about a property of it called $F$ that is a function of $\theta$. For this you have an estimator $\hat F$ that is a function of your data. Then the formula is:

$$E_\theta((F-\hat F)^2)=V_\theta(\hat F)+E_\theta(\hat F-F)^2$$

This formula gives you the MSE of your estimator, or if you prefer measures the quality of the estimator in terms of squared distance to what it is supposed to estimate.

For a predictor

$$MSE(\hat Y|X=x0)=Var(\hat Y)+Bias^2(\hat Y)+\sigma^2$$

In this case you have a model with additive noise. $X$ is the input, $Y$ is the output and the model is: $Y=f_\theta(X)+\epsilon$ where the noise is assumed to have mean 0 and known variance $\sigma$.

You had some data to learn from and now you take a new $x_0$ and want to predict the unknown outcome $y_0$. Since the noise has mean 0, it is natural to use $f_\theta(x_0)$ as a guess for $y_0$. You don't know $\theta$ and thus you don't know the value of the function (of $\theta$) $f_\theta(x_0)$. So you need to use an estimator of it called $\hat f(x_0)$. Using the previous formula for this estimator ($F=f(x_0)$) you know that:

$$E_\theta((f(x_0)-\hat f(x_0))^2)=V_\theta(\hat f(x_0))+E_\theta(\hat f(x_0)-f(x_0))^2$$

On the other hand, because of the additive noise:

$$E_\theta((f(x_0)-y_0)^2)=\sigma^2$$

Assuming the noise is independent of your training data, you can finally combine the two formulas (i'm skipping the technical details):

$$E_\theta((y_0-\hat f(x_0))^2)=V_\theta(\hat f(x_0))+E_\theta(\hat f(x_0)-f(x_0))^2+\sigma^2$$

In a less formal way:

  • take $x_0$ you want to predict the outcome $y_0$
  • you will predict it with the predictor $\hat Y=\hat f(x_0)$
  • you make two errors: the first one is because of the imprecision of your estimator $\hat f(x_0)$ of $f(x_0)$. The second is because of the noise.
  • they sum up because the noise is additive (and independent)
Related Question