I find the most difficult parts of this subject to be in keeping track of what is is being integrated over (the expectation with respect to what?). Many expositions take this lightly, and the end result can be highly confusing. Here is The Elements of Statistical Learning on this point
Discussions of error rate estimation can be confusing, because we have to make clear which quantities are fixed and which are random
I think you're seeing a bit of that in the wikipedia exposition.
My Preferred Derivation
Here's a detailed break down. Working out the exposition in this way is what finally cleared up the various concepts at play for me. I hope it helps you too.
The first step is to condition on $x$, and take the expectation with respect to $y$. There are no assumptions for this part, $g$ is a general function
$$\begin{align*}
ESE(f; x) &= E_Y \left[ \left( y - g(x) \right)^2 \mid x \right] \\
& = E_Y \left[ \left( y - E[y \mid x] + E[y \mid x]
- g(x) \right)^2 \mid x \right] \\
& = E_Y \left[ \left( y - E[y \mid x] \right)^2 \mid x \right]
+ E_Y \left[ \left(E[y \mid x] - g(x) \right)^2 \mid x \right] \\
& \quad + 2 E_Y \left[
\left( y - E[y \mid x] \right)
\left( E[y \mid x] - g(x) \right)
\mid x \right] \\
\end{align*}$$
The trick here is introducing the $E[ y \mid x ]$ into the middle of the sum. This expectation is over $y$ (the only thing that makes sense in this context), so $E[ y \mid x ]$ is just a function of $x$, same as $g$.
Now we may observe that in
$$ 2 E_Y \left[
\left( y - E[y \mid x] \right)
\left( E[y \mid x] - g(x) \right)
\mid x \right] $$
the second factor has no dependence on $y$, and the first factor is zero in expectation, so this term dies, and we are left with
$$\begin{align*}
ESE(f; x) = E_Y \left[ \left( y - E[y \mid x] \right)^2 \mid x \right]
+ E_Y \left[ \left(E[y \mid x] - g(x) \right)^2 \mid x \right] \\
\end{align*}$$
The first term is the irreducible error of the classifier. It's your $\sigma^2$ above.
Its worth noting that only the integrand of the first term in this result has any dependence on $y$, so the $E_Y$ may be dropped from the second term.
To get to the bias-variance breakdown, we introduce an expectation over the sampling distribution of $x$. That is, we consider $g$ itself as a random variable; training a learning algorithm takes us from a data set $D$ to a function $g$. Thus, it makes sense to take expectations over this sampling distribution with $g$ as a random variate. I'll write $g(x, D)$ to emphasize this dependence of $g$ on the training dataset $D$. With this notation, the second term in our final equation above becomes
$$ \left[ \left(E[y \mid x] - g(x,D) \right)^2 \mid x \right] $$
Taking the expectation of everything over $D$, and writing $Eg(x)$ for the expectation of $g(x, D)$ with respect to this sampling distribution, we can break this down further
$$\begin{align*}
& E_{D} \left[ \left(E[y \mid x] - g(x,D) \right)^2 \mid x \right] \\
& = E_{D} \left[ \left(E[y \mid x] - Eg(x) + Eg(x) - g(x,D) \right)^2 \mid x \right] \\
& = E_{D} \left[ \left(E[y \mid x] - Eg(x) \right)^2
\mid x \right]
+ E_{D} \left[ \left(Eg(x) - g(x, D) \right)^2 \mid x \right] \\
& \quad + 2 E_{D} \left[
\left(E[y \mid x] - Eg(x) \right)
\left( Eg(x) - g(x,D) \right)
\mid x \right] \\
\end{align*}
$$
The same trick applies. In this breakdown the first factor has no dependence on $D$, and the second is zero in expectation, so the cross term dies. Therefore
$$\begin{align*}
E_{D} \left[ \left(E[y \mid x] - g(x,D) \right)^2 \mid x \right] = E_{D} \left[ \left(E[y \mid x] - Eg(x) \right)^2
\mid x \right]
+ E_{D} \left[ \left(Ef(x) - g(x, D) \right)^2 \mid x \right]
\end{align*}$$
The first term here is the bias, and the second is the variance.
Of course, I kept the conditioning on $x$ the entire time, so what we have really arrived at is the pointwise decomposition. To get the usual full decomposition of the total error, all that is needed is to integrate out $x$ as well.
How This Relates to the Wikipedia Breakdown
Wikipedia writes $f(x)$ for my $E[ y \mid x ]$. I think it makes the exposition easier to follow to make this explicit. Wikipedia writes
$$ y = f(x) + \epsilon $$
with the assumption that $E(\epsilon) = 0$. What is really meant is $E(\epsilon \mid x) = 0$. This is implicit in my exposition, I simply use $E[y \mid x]$ throughout.
Note that no independence assumptions are necessary for the bias variance decomposition. What wikipedia says
Thus, since $\epsilon$ and $\hat{f}$ are independent
what they really mean is that $\hat{f}$ can be considered a constant when conditioning on $x$, so it can be taken out front of the expectation.
It helps to think carefully about exactly what type of objects $\hat \theta$ and $\hat g$ are.
In the top case, $\hat \theta$ would be what I would call an estimator of a parameter. Let's break it down. There is some true value we would like to gain knowledge about $\theta$, it is a number. To estimate the value of this parameter we use $\hat \theta$, which consumes a sample of data, and produces a number which we take to be an estimate of $\theta$. Said differently, $\hat \theta$ is a function which consumes a set of training data, and produces a number
$$ \hat \theta: \mathcal{T} \rightarrow \mathbb{R} $$
Often, when only one set of training data is around, people use the symbol $\hat \theta$ to mean the numeric estimate instead of the estimator, but in the grand scheme of things, this is a relatively benign abuse of notation.
OK, on to the second thing, what is $\hat g$? In this case, we are doing much the same, but this time we are estimating a function instead of a number. Now we consume a training dataset, and are returned a function from datapoints to real numbers
$$ \hat g: \mathcal{T} \rightarrow (\mathcal{X} \rightarrow \mathbb{R}) $$
This is a little mind bending the first time you think about it, but it's worth digesting.
Now, if we think of our samples as being distributed in some way, then $\hat \theta$ becomes a random variable, and we can take its expectation and variance and whatever we want, with no problem. But what is the variance of a function valued random variable? It's not really obvious.
The way out is to think like a computer programmer, what can functions do? They can be evaluated. This is where your $x_i$ comes in.
In this setup, $x_i$ is just a solitary fixed datapoint. The second equation is saying as long as you have a datapoint $x_i$ fixed, you can think of $\hat g$ as an estimator that returns a function, which you immediately evaluate to get a number. Now we're back in the situation where we consume datasets and get a number in return, so all our statistics of number values random variables comes to bear.
I've discussed this in a slightly different way in this answer.
Is it correct to think of this as each observation/fitted value having its own variance and bias?
Yup.
You can see this in confidence intervals around scatterplot smoothers, they tend to be wider near the boundaries of the data, as there the predicted value is more influenced by the neighborly training points. There are some examples in this tutorial on smoothing splines.
Best Answer
The two formulas express the bias variance trade-off in two different contexts (actually related to each-other). It is confusing to use the letter $Y$ in both cases where they actually stand for different things. For the first case, I'm going to use the letter $F$ instead.
For an estimator
In this case, you have a model with a parameter $\theta$ giving you the distribution of your data. You want to make an inference on this parameter, more precisely about a property of it called $F$ that is a function of $\theta$. For this you have an estimator $\hat F$ that is a function of your data. Then the formula is:
$$E_\theta((F-\hat F)^2)=V_\theta(\hat F)+E_\theta(\hat F-F)^2$$
This formula gives you the MSE of your estimator, or if you prefer measures the quality of the estimator in terms of squared distance to what it is supposed to estimate.
For a predictor
In this case you have a model with additive noise. $X$ is the input, $Y$ is the output and the model is: $Y=f_\theta(X)+\epsilon$ where the noise is assumed to have mean 0 and known variance $\sigma$.
You had some data to learn from and now you take a new $x_0$ and want to predict the unknown outcome $y_0$. Since the noise has mean 0, it is natural to use $f_\theta(x_0)$ as a guess for $y_0$. You don't know $\theta$ and thus you don't know the value of the function (of $\theta$) $f_\theta(x_0)$. So you need to use an estimator of it called $\hat f(x_0)$. Using the previous formula for this estimator ($F=f(x_0)$) you know that:
$$E_\theta((f(x_0)-\hat f(x_0))^2)=V_\theta(\hat f(x_0))+E_\theta(\hat f(x_0)-f(x_0))^2$$
On the other hand, because of the additive noise:
$$E_\theta((f(x_0)-y_0)^2)=\sigma^2$$
Assuming the noise is independent of your training data, you can finally combine the two formulas (i'm skipping the technical details):
$$E_\theta((y_0-\hat f(x_0))^2)=V_\theta(\hat f(x_0))+E_\theta(\hat f(x_0)-f(x_0))^2+\sigma^2$$
In a less formal way: