Solved – Bias / variance tradeoff math

bias-variance tradeoffmseunbiased-estimator

I understand the matter in the underfitting / overfitting terms but I still struggle to grasp the exact math behind it. I've checked several sources (here, here, here, here and here) but I still don't see why exactly bias and variance oppose each other like, e.g., $e^x$ and $e^{-x}$ do:


source

It seems like everybody derives the following equation (omitting the irreducible error $\epsilon$ here)
$$\newcommand{\var}{{\rm Var}}
E[(\hat{\theta}_n – \theta)^2]=E[(\hat{\theta}_n – E[\hat{\theta}_n])^2] + (E[\hat{\theta}_n – \theta])^2
$$
and then, instead of driving the point home and showing exactly why the terms on the right behave the way they do, starts wandering about the imperfections of this world and how impossible it is to be both precise and universal at the same time.

The obvious counterexample

Say, a population mean $\mu$ is being estimated using sample mean $\bar{X}_n = \frac{1}{n}\sum\limits_{i=1}^{n}X_i$, i.e. $\theta\equiv\mu$ and $\hat{\theta}_n\equiv\bar{X}_n$ then:
$$MSE = \var(\bar{X}_n – \mu) + (E[\bar{X}_n] – \mu)^2 $$
since $E[\bar{X}_n]=\mu$ and $\var(\mu) = 0$, we have:
$$MSE = \var(\bar{X}_n) = \frac{1}{n}\var(X)\xrightarrow[n\to\infty]{}0$$

So, the questions are:

  1. Why exactly $E[(\hat{\theta}_n – E[\hat{\theta}_n])^2]$ and $E[\hat{\theta}_n – \theta]$ cannot be decreased simultaneously?
  2. Why can't we just take some unbiased estimator and reduce the variance by increasing sample size?

Best Answer

First, nobody says that squared bias and variance behave just like $e^{\pm x}$, in case you are wondering. The point simply is that one increases and the other decreases. It'd similar to supply and demand curves in microeconomics, which are traditionally depicted as straight lines, which sometimes confuses people. Again, the point simply is that one slopes downward and the other upward.

Your key confusion is about what is on the horizontal axis. It's model complexity - not sample size. Yes, as you write, if we use some unbiased estimator, then increasing the sample size will reduce its variance, and we will get a better model. However, the bias-variance tradeoff is in the context of a fixed sample size, and what we vary is the model complexity, e.g., by adding predictors.

If model A is too small and does not contain predictors whose true parameter value is nonzero, and model B encompasses model A but contains all predictors whose parameter values are nonzero, then parameter estimates from model A will be biased and from model B unbiased - but the variance of parameter estimates in model A will be smaller than for the same parameters in model B.