Solved – Bias-variance decomposition

biasloss-functionsregularizationself-studyvariance

In section 3.2 of Bishop's Pattern Recognition and Machine Learning, he discusses the bias-variance decomposition, stating that for a squared loss function, the expected loss can be decomposed into a squared bias term (which describes how far the average predictions are from the true model), a variance term (which describes the spread of the predictions around the average), and a noise term (which gives the intrinsic noise of the data).

  1. Can bias-variance decomposition be performed with loss functions other than squared loss?
  2. For a given model dataset, is there more than one model whose expected loss is the minimum over all models, and if so, does that mean that there could be different combinations of bias and variance that yield the same minimum expected loss?
  3. If a model involves regularization, is there a mathematical relationship between bias, variance, and the regularization coefficient $\lambda$?
  4. How can you calculate bias if you don't know the true model?
  5. Are there situations in which it makes more sense to minimize bias or variance rather than expected loss (the sum of squared bias and variance)?

Best Answer

...the expected [squared error] loss can be decomposed into a squared bias term (which describes how far the average predictions are from the true model), a variance term (which describes the spread of the predictions around the average), and a noise term (which gives the intrinsic noise of the data).

When looking at the squared error loss decomposition $$\mathbb{E}_\theta[(\theta-\delta(X_{1:n}))^2]=(\theta-\mathbb{E}_\theta[\delta(X_{1:n})])^2+\mathbb{E}_\theta[(\mathbb{E}_\theta[\delta(X_{1:n})]-\delta(X_{1:n}))^2]$$ I only see two terms: one for the bias and another one for the variance of the estimator or predictor, $\delta(X_{1:n})$. There is no additional noise term in the expected loss. As should be since the variability is the variability of $\delta(X_{1:n})$, not of the sample itself.

  1. Can bias-variance decomposition be performed with loss functions other than squared loss?

My interpretation of the squared bias+variance decomposition [and the way I teach it] is that this is the statistical equivalent of Pythagore's Theorem, namely that the squared distance between an estimator and a point within a certain set is the sum of the squared distance between an estimator and the set, plus the squared distance between the orthogonal projection on the set and the point within the set. Any loss based on a distance with a notion of orthogonal projection, i.e., an inner product, i.e., essentially Hilbert spaces, satisfies this decomposition.

  1. For a given model dataset, is there more than one model whose expected loss is the minimum over all models, and if so, does that mean that there could be different combinations of bias and variance that yield the same minimum expected loss?

The question is unclear: if by minimum over models, you mean $$\min_\theta \mathbb{E}_\theta[(\theta-\delta(X_{1:n}))^2]$$ then there are many examples of statistical models and associated decisions with a constant expected loss (or risk). Take for instance the MLE of a Normal mean.

  1. How can you calculate bias if you don't know the true model?

In a generic sense, the bias is the distance between the true model and the closest model within the assumed family of distributions. If the true model is unknown, the bias can be ascertained by bootstrap.

  1. Are there situations in which it makes more sense to minimize bias or variance rather than expected loss (the sum of squared bias and variance)?

When considering another loss function like $$(\theta-\mathbb{E}_\theta[\delta(X_{1:n})])^2+\alpha[(\mathbb{E}_\theta[\delta(X_{1:n})]-\delta(X_{1:n}))^2]\qquad 0<\alpha$$ pushing $\alpha$ to zero puts most of the evaluation on the bias while pushing $\alpha$ to infinity switches the focus on the variance.