Decision Theory – Understanding Why Bayes Risk is Not Connected to Observed Data

bayes-riskdecision-theory

It puzzles me that the Bayes risk seems not connected to the observed data. Let me illustrate this with an example. Let a coin toss follow a Bernoulli distribution with a hidden parameter $\theta$ and let the prior of $\theta$ be a uniform distribution. For some loss function, $L(\theta, \delta(X))$, the Bayes risk is:
$$
\mathop{\mathbb{E}}[\mathop{\mathbb{E}}[L(\theta, \delta(X)|\theta]]
$$
,
where the inner distribution is taken over $P(X|\theta)$ and the outer distribution is taken over the prior distribution of $\theta$.

From a sampling perspective, to compute the Bayes Risk, we can first sample some $\theta$s from the uniform prior, then for each sampled $\theta$, sample some coin toss $X$ from the Berlounni distribution given the $\theta$, and finally compute $L(\theta, \delta(X))$ for each sampled pair of $(\theta, X)$ and take the average.

What is interesting is that none of this computation has anything to do with observed data. Perhaps that the generated data is from a fair coin and the posterior on the observed data will tell us that it is highly unlikely for $\theta$ to take on the value $\theta<0.2$ or $\theta>0.8$, but, nevertheless, the above computation will still assign 40% probability mass to $\theta<0.2$ or $\theta>0.8$.

Another way to look at this problem is through the following identity:

$$
\mathop{\mathbb{E}}[\mathop{\mathbb{E}}[L(\theta, \delta(X)|\theta]] = \mathop{\mathbb{E}}[\mathop{\mathbb{E}}[L(\theta, \delta(X)|X]]
$$

On the right-hand side, the inner expectation is expected posterior loss, $\mathop{\mathbb{E}}[L(\theta, \delta(X)|X]$. It is taken over the posterior distribution of $\theta$ given $X$. This is all fine, as "observed" data is considered. However, the outer expectation is taken over the marginal unconditional distribution of $X$:

$$
P(X \in A) = \int P(X|\theta) dP(\theta)
$$

In our example, $P(X|\theta)$ would be Bernoulli and $P(\theta)$ would be uniform.

This distribution of $X$ clearly has nothing to do with the distribution of observed $X$: the observed distribution may be from a fair coin, but this distribution could be quite different. But this distribution is used to assign weight to the posterior expected loss. It seems to indicate that it will give high probability mass to regions of $X$ values that don't really occur in reality.

My question is therefore why we would want to compute the Bayes Risk using the prior distribution? What is the intuition behind it? Is it possible to take expectation of $\mathop{\mathbb{E}}[L(\theta, \delta(X)|\theta]$ using the posterior instead?

Best Answer

The Bayes risk is used to attach a single number to a decision procedure $\delta(\cdot)$, hence to rank all procedures under a given prior, and therefore to find an optimal Bayesian procedure. In the non-Bayesian or frequentist setting, the risk $$\mathbb E_\theta[L(\theta,\delta(X))]$$ is a function of $\theta$, which makes procedures non-ordered in most settings and then prevents the derivation of an optimal frequentist procedure, unless further restrictions are imposed on the procedures.

Conditioning upon the observed data $x$ leads to the posterior expected loss $$\mathbb E[L(\theta,\delta(X))|X=x]=\mathbb E[L(\theta,\delta(x))|X=x]$$ where the error is integrated out in $\theta$ wrt the posterior distribution of $\theta$ given $X=x$. Once again, this quantity is a real number for a given realisation $X=x$ which allows for the comparison of all possible values of $\delta(x)$ [in $\delta$ not in $x$] and thus for the derivation of the optimal Bayesian decision value $\delta^\pi(x)$. The optimal Bayesian decision procedure thus associates with each possible realisation $x$ of $X$ the decision value $\delta^\pi(x)$, meaning it is feasible to always reach the optimal decision.

That the marginal distribution $m(\cdot)$ is involved in the Bayes risk is of no particular concern and a consequence of the identity $$\pi(\theta) f(x|\theta) = m(x) \pi(\theta|x)$$ The Bayesian decision $\delta^\pi(x)$ depends on the actual observation $x$ and minimises the posterior loss $\rho(d,x)$. The Bayes risk is the maginal average of the errors (posterior losses) across all possible realisations of $X$, with theoretical uses in admissibility and minimaxity theorems, but it is not used as such to reach the optimal decision.

Related Question