Simply note that
$$|| \widehat{\mu} - \mu ||^2 = \sum\limits_{i = 1}^{n} (\widehat{\mu}_{i} - \mu_{i})^2$$
Then, the answer is given by the decomposition you gave earlier:
$$ \mathbb{E}[(\widehat{\mu}_{i} - \mu_{i})^2] = Var[\widehat{\mu}_{i}] + [Bias(\widehat{\mu}_{i}, \mu_{i} )]^2 $$
Summing all up, we get
$$ \mathbb{E}[||\widehat{\mu} - \mu||^2] = \sum\limits_{i = 1}^{n} Var[\widehat{\mu}_{i}] + [Bias(\widehat{\mu}_{i}, \mu_{i} )]^2 $$
Another issue, totally different, is the covariance matrix $\mathbb{E}[(\widehat{\mu} - \mu)(\widehat{\mu} - \mu)^t]$.
The similarity is more than superficial.
The "bias-variance tradeoff" can be interpreted as the Pythagorean Theorem applied to two perpendicular Euclidean vectors: the length of one is the standard deviation and the length of the other is the bias. The length of the hypotenuse is the root mean squared error.
A fundamental relationship
As a point of departure, consider this revealing calculation, valid for any random variable $X$ with a finite second moment and any real number $a$. Since the second moment is finite, $X$ has a finite mean $\mu=\mathbb{E}(X)$ for which $\mathbb{E}(X-\mu)=0$, whence
$$\eqalign{
\mathbb{E}((X-a)^2) &= \mathbb{E}((X-\mu\,+\,\mu-a)^2) \\
&= \mathbb{E}((X-\mu)^2) + 2 \mathbb{E}(X-\mu)(\mu-a) + (\mu-a)^2 \\
&= \operatorname{Var}(X) + (\mu-a)^2.\tag{1}
}$$
This shows how the mean squared deviation between $X$ and any "baseline" value $a$ varies with $a$: it is a quadratic function of $a$ with a minimum at $\mu$, where the mean squared deviation is the variance of $X$.
The connection with estimators and bias
Any estimator $\hat \theta$ is a random variable because (by definition) it is a (measurable) function of random variables. Letting it play the role of $X$ in the preceding, and letting the estimand (the thing $\hat\theta$ is supposed to estimate) be $\theta$, we have
$$\operatorname{MSE}(\hat\theta) = \mathbb{E}((\hat\theta-\theta)^2) = \operatorname{Var}(\hat\theta) + (\mathbb{E}(\hat\theta)-\theta)^2.$$
Let's return to $(1)$ now that we have seen how the statement about bias+variance for an estimator is literally a case of $(1)$. The question seeks "mathematical analogies with mathematical objects." We can do more than that by showing that square-integrable random variables can naturally be made into a Euclidean space.
Mathematical background
In a very general sense, a random variable is a (measurable) real-valued function on a probability space $(\Omega, \mathfrak{S}, \mathbb{P})$. The set of such functions that are square integrable, which is often written $\mathcal{L}^2(\Omega)$ (with the given probability structure understood), almost is a Hilbert space. To make it into one, we have to conflate any two random variables $X$ and $Y$ which don't really differ in terms of integration: that is, we say $X$ and $Y$ are equivalent whenever
$$\mathbb{E}(|X-Y|^2) = \int_\Omega |X(\omega)-Y(\omega)|^2 d\mathbb{P}(\omega) = 0.$$
It's straightforward to check that this is a true equivalence relation: most importantly, when $X$ is equivalent to $Y$ and $Y$ is equivalent to $Z$, then necessarily $X$ will be equivalent to $Z$. We may therefore partition all square-integrable random variables into equivalence classes. These classes form the set $L^2(\Omega)$. Moreover, $L^2$ inherits the vector space structure of $\mathcal{L}^2$ defined by pointwise addition of values and pointwise scalar multiplication. On this vector space, the function
$$X \to \left(\int_\Omega |X(\omega)|^2 d\mathbb{P}(\omega)\right)^{1/2}=\sqrt{\mathbb{E}(|X|^2)}$$
is a norm, often written $||X||_2$. This norm makes $L^2(\Omega)$ into a Hilbert space. Think of a Hilbert space $\mathcal{H}$ as an "infinite dimensional Euclidean space." Any finite-dimensional subspace $V\subset \mathcal{H}$ inherits the norm from $\mathcal{H}$ and $V$, with this norm, is a Euclidean space: we can do Euclidean geometry in it.
Finally, we need one fact that is special to probability spaces (rather than general measure spaces): because $\mathbb{P}$ is a probability, it is bounded (by $1$), whence the constant functions $\omega\to a$ (for any fixed real number $a$) are square integrable random variables with finite norms.
A geometric interpretation
Consider any square-integrable random variable $X$, thought of as a representative of its equivalence class in $L^2(\Omega)$. It has a mean $\mu=\mathbb{E}(X)$ which (as one can check) depends only on the equivalence class of $X$. Let $\mathbf{1}:\omega\to 1$ be the class of the constant random variable.
$X$ and $\mathbf{1}$ generate a Euclidean subspace $V\subset L^2(\Omega)$ whose dimension is at most $2$. In this subspace, $||X||_2^2 = \mathbb{E}(X^2)$ is the squared length of $X$ and $||a\,\mathbf{1}||_2^2 = a^2$ is the squared length of the constant random variable $\omega\to a$. It is fundamental that $X-\mu\mathbf{1}$ is perpendicular to $\mathbf{1}$. (One definition of $\mu$ is that it's the unique number for which this is the case.) Relation $(1)$ may be written
$$||X - a\mathbf{1}||_2^2 = ||X - \mu\mathbf{1}||_2^2 + ||(a-\mu)\mathbf{1}||_2^2.$$
It indeed is precisely the Pythagorean Theorem, in essentially the same form known 2500 years ago. The object $$X-a\mathbf{1} = (X-\mu\mathbf{1})-(a-\mu)\mathbf{1}$$ is the hypotenuse of a right triangle with legs $X-\mu\mathbf{1}$ and $(a-\mu)\mathbf{1}$.
If you would like mathematical analogies, then, you may use anything that can be expressed in terms of the hypotenuse of a right triangle in a Euclidean space. The hypotenuse will represent the "error" and the legs will represent the bias and the deviations from the mean.
Best Answer
It helps to think carefully about exactly what type of objects $\hat \theta$ and $\hat g$ are.
In the top case, $\hat \theta$ would be what I would call an estimator of a parameter. Let's break it down. There is some true value we would like to gain knowledge about $\theta$, it is a number. To estimate the value of this parameter we use $\hat \theta$, which consumes a sample of data, and produces a number which we take to be an estimate of $\theta$. Said differently, $\hat \theta$ is a function which consumes a set of training data, and produces a number
$$ \hat \theta: \mathcal{T} \rightarrow \mathbb{R} $$
Often, when only one set of training data is around, people use the symbol $\hat \theta$ to mean the numeric estimate instead of the estimator, but in the grand scheme of things, this is a relatively benign abuse of notation.
OK, on to the second thing, what is $\hat g$? In this case, we are doing much the same, but this time we are estimating a function instead of a number. Now we consume a training dataset, and are returned a function from datapoints to real numbers
$$ \hat g: \mathcal{T} \rightarrow (\mathcal{X} \rightarrow \mathbb{R}) $$
This is a little mind bending the first time you think about it, but it's worth digesting.
Now, if we think of our samples as being distributed in some way, then $\hat \theta$ becomes a random variable, and we can take its expectation and variance and whatever we want, with no problem. But what is the variance of a function valued random variable? It's not really obvious.
The way out is to think like a computer programmer, what can functions do? They can be evaluated. This is where your $x_i$ comes in.
In this setup, $x_i$ is just a solitary fixed datapoint. The second equation is saying as long as you have a datapoint $x_i$ fixed, you can think of $\hat g$ as an estimator that returns a function, which you immediately evaluate to get a number. Now we're back in the situation where we consume datasets and get a number in return, so all our statistics of number values random variables comes to bear.
I've discussed this in a slightly different way in this answer.
Yup.
You can see this in confidence intervals around scatterplot smoothers, they tend to be wider near the boundaries of the data, as there the predicted value is more influenced by the neighborly training points. There are some examples in this tutorial on smoothing splines.