Solved – Understanding the Bayes risk

bayesiandecision-theory

When evaluating an estimator, the two probably most common used criteria are the maximum risk and the Bayes risk. My question refers to the latter one:

The bayes risk under the prior $\pi$ is defined as follows:

$$B_{\pi} (\hat{\theta}) = \int R(\theta, \hat{\theta} ) \pi ( \theta ) d \theta $$

I don't quite get what the prior $\pi$ is doing and how I should interpret it. If I have a risk function $R(\theta, \hat{\theta} )$ and plot it, intuitively I would take its area as a criterion to judge how "strong" the risk is over all possible values of $\theta$. But involving the prior somehow destroys this intuition again, although it is close. Can someone help me how to interpret the prior?

Best Answer

[Here is an excerpt from my own textbook, The Bayesian Choice (2007), that argues in favour of a decision-theoretic approach to Bayesian analysis, hence of using the Bayes risk.]

Except for the most trivial settings, it is generally impossible to uniformly minimize (in $d$) the loss function $\text{L}(\theta,d)$ when $\theta$ is unknown. In order to derive an effective comparison criterion from the loss function, the frequentist approach proposes to consider instead the average loss (or frequentist risk) \begin{eqnarray*} R(\theta,\delta) & = & \mathbb{E}_\theta \lbrack \text{L} (\theta ,\delta(x))\rbrack \\ & = & \int_{\cal X} \text{L}(\theta,\delta(x))f(x|\theta) \,dx , \end{eqnarray*} where $\delta(x)$ is the decision rule, i.e., the allocation of a decision to each outcome $x\sim f(x|\theta)$ from the random experiment.

The function $\delta$, from ${\mathcal X}$ in $\mathfrak{D}$, is usually called estimator (while the value $\delta(x)$ is called estimate of $\theta$). When there is no risk of confusion, we also denote the set of estimators by $\mathfrak{D}$.

The frequentist paradigm relies on this criterion to compare estimators and, if possible, to select the best estimator, the reasoning being that estimators are evaluated on their long-run performance for all possible values of the parameter $\theta$. Notice, however, that there are several difficulties associated with this approach.

  1. The error (loss) is averaged over the different values of $x$ proportionally to the density $f(x|\theta)$. Therefore, it seems that the observation $x$ is not taken into account any further. The risk criterion evaluates procedures on their long-run performance and not directly for the given observation, $x$. Such an evaluation may be satisfactory for the statistician, but it is not so appealing for a client, who wants optimal results for her data $x$, not that of another's!
  2. The frequentist analysis of the decision problem implicitly assumes that this problem will be met again and again, for the frequency evaluation to make sense. Indeed, $R(\theta,\delta)$ is approximately the average loss over i.i.d. repetitions of the same experiment, according to the Law of Large Numbers. However, on both philosophical and practical grounds, there is a lot of controversy over the very notion of repeatability of experiments (see Jeffreys (1961)). For one thing, if new observations come to the statistician, she should make use of them, and this could modify the way the experiment is conducted, as in, for instance, medical trials.
  3. For a procedure $\delta$, the risk $R(\theta, \delta)$ is a function of the parameter $\theta$. Therefore, the frequentist approach does not induce a total ordering on the set of procedures. It is generally impossible to compare decision procedures with this criterion, since two crossing risk functions prevent comparison between the corresponding estimators. At best, one may hope for a procedure $\delta_0$ that uniformly minimizes $R(\theta,\delta)$, but such cases rarely occur unless the space of decision procedures is restricted. Best procedures can only be obtained by restricting rather artificially the set of authorized procedures.

Example 2.4 - Consider $x_1$ and $x_2$, two observations from $$ P_{\theta}(x = \theta-1) = P_{\theta}(x = \theta+1) = 0.5, \qquad \theta\in\mathbb{R}. $$ The parameter of interest is $\theta$ (i.e., $\mathfrak{D} = \Theta$) and it is estimated by estimators $\delta$ under the loss $$ \text{L}(\theta,\delta) = 1-\mathbb{I}_{\theta}(\delta), $$ often called $0-1$ loss, which penalizes errors of estimation, whatever their magnitude, by $1$. Considering the particular \est $$ \delta_0(x_1,x_2) = {x_1+x_2 \over 2}, $$ its risk function is \begin{eqnarray*} R(\theta,\delta_0) & = & 1-P_{\theta}(\delta_0(x_1,x_2) = \theta) \\ & = & 1-P_{\theta}(x_1 \ne x_2) = 0.5. \end{eqnarray*} This computation shows that the estimator $\delta_0$ is correct half of the time. Actually, this estimator is always correct when $x_1\ne x_2$, and always wrong otherwise. Now, the \est\ $\delta_1(x_1,x_2) = x_1+1$ also has a risk function equal to $0.5$, as does $\delta_2(x_1,x_2) = x_2-1$. Therefore, $\delta_0$, $\delta_1$ and $\delta_2$ cannot be ranked under the $0-1$ loss. $\blacktriangleright$

On the contrary, the Bayesian approach to Decision Theory integrates on the space $\Theta$ since $\theta$ is unknown, instead of integrating on the space ${\cal X}$ as $x$ is known. It relies on the posterior expected loss \begin{eqnarray*} \rho(\pi,d|x) & = & \mathbb{E}^\pi[L(\theta,d)|x] \\ & = & \int_{\Theta} \text{L}(\theta,d) \pi(\theta|x)\, d\theta, \end{eqnarray*} which averages the error (i.e., the loss) according to the posterior distribution of the parameter $\theta$, conditionally on the observed value} $x$. Given $x$, the average error resulting from decision $d$ is actually $\rho(\pi,d|x)$. The posterior expected loss is thus a function of $x$ but this dependence is not troublesome, as opposed to the frequentist dependence of the risk on the parameter because $x$, contrary to $\theta$, is known.

Related Question