Bias Variance – Philosophical Insights of Bias Variance Decomposition

biasbias-variance tradeoffestimatorsloss-functionsvariance

As we know that we can perform a Bias Variance decomposition of an Estimator with MSE as loss function and it will look like below:

$$\operatorname{MSE}(\hat{\theta}) = \operatorname{tr}(\operatorname{Var}[\hat{\theta}]) + (\|{\operatorname{Bias}[\hat{\theta}]}\|)^2$$

Similarly, if we want to perform a Bias Variance decomposition of an predictor with MSE as a loss function then we it will look like:

$$\operatorname{MSE}(\hat{y}\mid X) = \operatorname{Var}[\hat{y}] + (\|{\operatorname{Bias}[\hat{y}]}\|)^2 + \sigma_{\varepsilon}^2 $$

I am more curious to to know the philosophy to break down a estimator or a predictor into Variance and Bias term. Why not some other terms?
It is more of a broad question of why we can think of breaking estimators and predictors into this form.

Just thinking aloud we can break a predictor into known distribution plus and error term or an estimator into an know distribution of the sample and an error term.

Please do correct me if I have some misunderstanding in terms of my thought process.

The paper which triggered this question in my head (Bit unrelated): https://faculty.wharton.upenn.edu/wp-content/uploads/2012/04/Strong.pdf

Updated:

  • Edit 1: Predictor Error with $\sigma_\varepsilon^2$
  • Edit 2: Updated reference paper

Best Answer

Bias and variance are elementary properties of estimators, and they're usually introduced to early statistics students because they're well understood conceptually, and one can study the properties of the quite restricted class of unbiased estimators - namely the Cramer-Rao bound, sufficiency, asymptotic relative efficiency, etc.

The fact that squared bias and variance result as a decomposition of squared error is, if anything, an elegant result. The number of questions that immediately follow are too numerous to count, is squared error loss the right loss? Under what conditions is it optimal? Does a similar result exists for other loss functions? To paraphrase Paul Erdos, "Anyone can think of an interesting problem."

The first year theory approach has its problems too. Consider Hodge's superefficient estimator. It is unbiased and it beats the Cramer-Rao bound. But it turns out that the estimator is not regular.

More broadly, once we start considering biased estimators, we have a much broader class of estimators with different optimality properties to consider. Concepts like admissibility, minimax, penalized or bounded loss, etc. give rise to other popular estimators as solutions to particular problems, particularly Bayes estimators, ridge estimators, and so on. These concepts would be covered in a second year statistics or probability theory class, from such texts as Ferguson "A Course in Large Sample Theory", Lehmann Casella "Theory of Point Estimation", or Wassermans' "All of Nonparametric Statistics".