When evaluating an estimator, the two probably most common used criteria are the maximum risk and the Bayes risk. My question refers to the latter one:
The bayes risk under the prior $\pi$ is defined as follows:
$$B_{\pi} (\hat{\theta}) = \int R(\theta, \hat{\theta} ) \pi ( \theta ) d \theta $$
I don't quite get what the prior $\pi$ is doing and how I should interpret it. If I have a risk function $R(\theta, \hat{\theta} )$ and plot it, intuitively I would take its area as a criterion to judge how "strong" the risk is over all possible values of $\theta$. But involving the prior somehow destroys this intuition again, although it is close. Can someone help me how to interpret the prior?
Best Answer
[Here is an excerpt from my own textbook, The Bayesian Choice (2007), that argues in favour of a decision-theoretic approach to Bayesian analysis, hence of using the Bayes risk.]
Except for the most trivial settings, it is generally impossible to uniformly minimize (in $d$) the loss function $\text{L}(\theta,d)$ when $\theta$ is unknown. In order to derive an effective comparison criterion from the loss function, the frequentist approach proposes to consider instead the average loss (or frequentist risk) \begin{eqnarray*} R(\theta,\delta) & = & \mathbb{E}_\theta \lbrack \text{L} (\theta ,\delta(x))\rbrack \\ & = & \int_{\cal X} \text{L}(\theta,\delta(x))f(x|\theta) \,dx , \end{eqnarray*} where $\delta(x)$ is the decision rule, i.e., the allocation of a decision to each outcome $x\sim f(x|\theta)$ from the random experiment.
The function $\delta$, from ${\mathcal X}$ in $\mathfrak{D}$, is usually called estimator (while the value $\delta(x)$ is called estimate of $\theta$). When there is no risk of confusion, we also denote the set of estimators by $\mathfrak{D}$.
The frequentist paradigm relies on this criterion to compare estimators and, if possible, to select the best estimator, the reasoning being that estimators are evaluated on their long-run performance for all possible values of the parameter $\theta$. Notice, however, that there are several difficulties associated with this approach.
Example 2.4 - Consider $x_1$ and $x_2$, two observations from $$ P_{\theta}(x = \theta-1) = P_{\theta}(x = \theta+1) = 0.5, \qquad \theta\in\mathbb{R}. $$ The parameter of interest is $\theta$ (i.e., $\mathfrak{D} = \Theta$) and it is estimated by estimators $\delta$ under the loss $$ \text{L}(\theta,\delta) = 1-\mathbb{I}_{\theta}(\delta), $$ often called $0-1$ loss, which penalizes errors of estimation, whatever their magnitude, by $1$. Considering the particular \est $$ \delta_0(x_1,x_2) = {x_1+x_2 \over 2}, $$ its risk function is \begin{eqnarray*} R(\theta,\delta_0) & = & 1-P_{\theta}(\delta_0(x_1,x_2) = \theta) \\ & = & 1-P_{\theta}(x_1 \ne x_2) = 0.5. \end{eqnarray*} This computation shows that the estimator $\delta_0$ is correct half of the time. Actually, this estimator is always correct when $x_1\ne x_2$, and always wrong otherwise. Now, the \est\ $\delta_1(x_1,x_2) = x_1+1$ also has a risk function equal to $0.5$, as does $\delta_2(x_1,x_2) = x_2-1$. Therefore, $\delta_0$, $\delta_1$ and $\delta_2$ cannot be ranked under the $0-1$ loss. $\blacktriangleright$
On the contrary, the Bayesian approach to Decision Theory integrates on the space $\Theta$ since $\theta$ is unknown, instead of integrating on the space ${\cal X}$ as $x$ is known. It relies on the posterior expected loss \begin{eqnarray*} \rho(\pi,d|x) & = & \mathbb{E}^\pi[L(\theta,d)|x] \\ & = & \int_{\Theta} \text{L}(\theta,d) \pi(\theta|x)\, d\theta, \end{eqnarray*} which averages the error (i.e., the loss) according to the posterior distribution of the parameter $\theta$, conditionally on the observed value} $x$. Given $x$, the average error resulting from decision $d$ is actually $\rho(\pi,d|x)$. The posterior expected loss is thus a function of $x$ but this dependence is not troublesome, as opposed to the frequentist dependence of the risk on the parameter because $x$, contrary to $\theta$, is known.