Bayesian Inference – Difference Between Risk Function in Bayesian Inference and Supervised Learning

bayesianloss-functionsrisksupervised learning

In the context of Bayesian inference, given

  • the random parameter $\Theta$,
  • the observed data $\mathcal{D} = \{x_1,x_2,\dots,x_N\}$,
  • the posterior $p(\theta\mid \mathcal{D})$,
  • the estimator $\hat\theta(\mathcal{D})$,
  • and the loss function $L\left(\hat\theta(\mathcal{D}),\Theta\right)$,

the Bayesian risk function is defined as
$$
R_B\left(\hat\theta,\mathcal{D}\right) = \mathbb{E}_{p(\theta\mid\mathcal{D})}\left[L\left(\hat\theta(\mathcal{D}),\Theta\right) \right] = \int_\theta L\left(\hat\theta(\mathcal{D}),\theta\right) \cdot p(\theta\mid\mathcal{D}) \ \text{d}\theta
$$

In contrast, in the context of supervised learning, given

  • the joint distribution $p(x,y)$,
  • the hypothesis $h(x)$,
  • and the loss function $L\left(h(x),y\right)$,

the supervised learning risk function is defined as
$$
R_{SL}\left(h\right) = \mathbb{E}_{p(x,y)}\left[L\left(h(X),Y\right) \right] = \int_x \int_y L\left(h(x),y\right) \cdot p(x,y) \ \text{d}x \ \text{d}y
$$

Is there a relationship between $R_B\left(\hat\theta,\mathcal{D}\right)$ and $R_{SL}\left(h\right)$?

Best Answer

If we let $$ Y := \Theta \\ h := \hat\theta $$ then $R_{SL}(h)$ becomes $$ R_{SL}(\hat\theta) = \mathbb{E}_{p(x,\theta)}\left[L\left(\hat\theta(X),\Theta\right) \right] $$ Using the law of total expectation, \begin{align} R_{SL}(\hat\theta) &= \mathbb{E}_{p(x,\theta)}\left[L\left(\hat\theta(X),\Theta\right) \right]\\ &= \mathbb{E}_{p(x)}\left[\mathbb{E}_{p(\theta\mid x)}\left[L\left(\hat\theta(X),\Theta\right) \right] \right] \end{align} Without loss of generality, if we let $$ X := (X_1,X_2,\dots,X_N) $$ then \begin{align} R_{SL}(\hat\theta) &= \mathbb{E}_{p(x)}\left[\mathbb{E}_{p(\theta\mid x)}\left[L\left(\hat\theta(X),\Theta\right) \right] \right] \\ &= \mathbb{E}_{p(\mathcal{D})}\left[\mathbb{E}_{p(\theta\mid \mathcal{D})}\left[L\left(\hat\theta(\mathcal{D}),\Theta\right) \right] \right] \end{align} Note that $$ \mathbb{E}_{p(\theta\mid \mathcal{D})}\left[L\left(\hat\theta(\mathcal{D}),\Theta\right) \right] = R_B(\hat\theta,\mathcal{D}) $$ and so $$ R_{SL}(\hat\theta) = \mathbb{E}_{p(\mathcal{D})}\left[R_B(\hat\theta,\mathcal{D})\right] $$ This means that $R_{SL}(\hat\theta)$ is just the Bayesian risk averaged over all possible datasets $\mathcal{D}$.

Related Question