...the expected [squared error] loss can be decomposed into a squared
bias term (which describes how far the average predictions are from
the true model), a variance term (which describes the spread of the
predictions around the average), and a noise term (which gives the
intrinsic noise of the data).
When looking at the squared error loss decomposition
$$\mathbb{E}_\theta[(\theta-\delta(X_{1:n}))^2]=(\theta-\mathbb{E}_\theta[\delta(X_{1:n})])^2+\mathbb{E}_\theta[(\mathbb{E}_\theta[\delta(X_{1:n})]-\delta(X_{1:n}))^2]$$
I only see two terms: one for the bias and another one for the variance of the estimator or predictor, $\delta(X_{1:n})$. There is no additional noise term in the expected loss. As should be since the variability is the variability of $\delta(X_{1:n})$, not of the sample itself.
- Can bias-variance decomposition be performed with loss functions other
than squared loss?
My interpretation of the squared bias+variance decomposition [and the way I teach it] is that this is the statistical equivalent of Pythagore's Theorem, namely that the squared distance between an estimator and a point within a certain set is the sum of the squared distance between an estimator and the set, plus the squared distance between the orthogonal projection on the set and the point within the set. Any loss based on a distance with a notion of orthogonal projection, i.e., an inner product, i.e., essentially Hilbert spaces, satisfies this decomposition.
- For a given model dataset, is there more than one model whose expected
loss is the minimum over all models, and if so, does that mean that
there could be different combinations of bias and variance that yield
the same minimum expected loss?
The question is unclear: if by minimum over models, you mean
$$\min_\theta \mathbb{E}_\theta[(\theta-\delta(X_{1:n}))^2]$$
then there are many examples of statistical models and associated decisions with a constant expected loss (or risk). Take for instance the MLE of a Normal mean.
- How can you calculate bias if you don't know the true model?
In a generic sense, the bias is the distance between the true model and the closest model within the assumed family of distributions. If the true model is unknown, the bias can be ascertained by bootstrap.
- Are there situations in which it makes more sense to minimize bias or variance rather than expected loss (the sum of squared bias and
variance)?
When considering another loss function like
$$(\theta-\mathbb{E}_\theta[\delta(X_{1:n})])^2+\alpha[(\mathbb{E}_\theta[\delta(X_{1:n})]-\delta(X_{1:n}))^2]\qquad 0<\alpha$$
pushing $\alpha$ to zero puts most of the evaluation on the bias while pushing $\alpha$ to infinity switches the focus on the variance.
It helps to think carefully about exactly what type of objects $\hat \theta$ and $\hat g$ are.
In the top case, $\hat \theta$ would be what I would call an estimator of a parameter. Let's break it down. There is some true value we would like to gain knowledge about $\theta$, it is a number. To estimate the value of this parameter we use $\hat \theta$, which consumes a sample of data, and produces a number which we take to be an estimate of $\theta$. Said differently, $\hat \theta$ is a function which consumes a set of training data, and produces a number
$$ \hat \theta: \mathcal{T} \rightarrow \mathbb{R} $$
Often, when only one set of training data is around, people use the symbol $\hat \theta$ to mean the numeric estimate instead of the estimator, but in the grand scheme of things, this is a relatively benign abuse of notation.
OK, on to the second thing, what is $\hat g$? In this case, we are doing much the same, but this time we are estimating a function instead of a number. Now we consume a training dataset, and are returned a function from datapoints to real numbers
$$ \hat g: \mathcal{T} \rightarrow (\mathcal{X} \rightarrow \mathbb{R}) $$
This is a little mind bending the first time you think about it, but it's worth digesting.
Now, if we think of our samples as being distributed in some way, then $\hat \theta$ becomes a random variable, and we can take its expectation and variance and whatever we want, with no problem. But what is the variance of a function valued random variable? It's not really obvious.
The way out is to think like a computer programmer, what can functions do? They can be evaluated. This is where your $x_i$ comes in.
In this setup, $x_i$ is just a solitary fixed datapoint. The second equation is saying as long as you have a datapoint $x_i$ fixed, you can think of $\hat g$ as an estimator that returns a function, which you immediately evaluate to get a number. Now we're back in the situation where we consume datasets and get a number in return, so all our statistics of number values random variables comes to bear.
I've discussed this in a slightly different way in this answer.
Is it correct to think of this as each observation/fitted value having its own variance and bias?
Yup.
You can see this in confidence intervals around scatterplot smoothers, they tend to be wider near the boundaries of the data, as there the predicted value is more influenced by the neighborly training points. There are some examples in this tutorial on smoothing splines.
Best Answer
Bias and variance are elementary properties of estimators, and they're usually introduced to early statistics students because they're well understood conceptually, and one can study the properties of the quite restricted class of unbiased estimators - namely the Cramer-Rao bound, sufficiency, asymptotic relative efficiency, etc.
The fact that squared bias and variance result as a decomposition of squared error is, if anything, an elegant result. The number of questions that immediately follow are too numerous to count, is squared error loss the right loss? Under what conditions is it optimal? Does a similar result exists for other loss functions? To paraphrase Paul Erdos, "Anyone can think of an interesting problem."
The first year theory approach has its problems too. Consider Hodge's superefficient estimator. It is unbiased and it beats the Cramer-Rao bound. But it turns out that the estimator is not regular.
More broadly, once we start considering biased estimators, we have a much broader class of estimators with different optimality properties to consider. Concepts like admissibility, minimax, penalized or bounded loss, etc. give rise to other popular estimators as solutions to particular problems, particularly Bayes estimators, ridge estimators, and so on. These concepts would be covered in a second year statistics or probability theory class, from such texts as Ferguson "A Course in Large Sample Theory", Lehmann Casella "Theory of Point Estimation", or Wassermans' "All of Nonparametric Statistics".