Bayesian Estimator – Why Under 0-1 Loss Function, the Bayesian Estimator is Mode of Posterior Distribution

bayesiancalculusprobability

My notes are rather light when it comes to this topic. I understand that the bayesian estimator, defined as (for sample space $\hat{x}$):

$E[\Theta | \hat{x}] = \int_{ \forall \Theta}yf_{\theta|\hat{x}}(y|\hat{x})dy $ (ie. the mean of the posterior distribution).

You can then look at loss functions where the form of the loss function determines that of $\theta^{'}$ (the estimator of $\theta$). By setting the loss function as a quadratic, absolute error or zero-one you get $\theta^{'}$ as the mean of the posterior, the median of the posterior and the mode of the posterior respectively.

The first two proofs make sense to me but it is the 3rd that I am not sure about. This is my understanding:

$E[l] = \int l(\theta^{'}, \theta)f(\theta|\hat{x})d\theta = \int (\theta^{'} – \theta)f(\theta|\hat{x})d\theta $

If $\theta^{'} = \theta \Rightarrow E[l] = 0$

If $\theta^{'} \neq \theta \Rightarrow \int (\theta^{'} – \theta)f(\theta|\hat{x})d\theta = 1 $ (as integrating over whole domain of $\theta$).

This proof obviously isn't rigorous, (and it could be incorrect too). What I can't see regardless is how this can be regarded as the 'mode' of the posterior distribution.

Thanks in advance!

Best Answer

You need to be a bit careful with this kind of problem because the definition of the zero-one loss function will depend on whether you are dealing with a discrete or continuous parameter. For a discrete parameter you can define the zero-one loss as an indicator function and this works fine. For a continuous parameter you can't do this, because if you integrate a discrete indicator over a continuous probability density function you will always get zero (so the expected loss would be zero regardless of the parameter estimator). In the latter case you need to define the zero-one loss function either by allowing some "tolerance" around the exact value, or by using the Dirac delta function. Below I show the derivation of the posterior mode estimator in both the discrete and continuous cases, using the Dirac function in the latter. I also unify these cases by using Lebesgue-Stieltjes integration.

Discrete case: Suppose that the unknown parameter $\theta$ is a discrete random variable, and let $\hat{\theta}$ denote the estimator of this parameter. Then the zero-one loss function is defined as:

$$L(\hat{\theta} , \theta) = \mathbb{I}(\hat{\theta} \neq \theta).$$

This gives expected loss:

$$\begin{equation} \begin{aligned} \bar{L}(\hat{\theta} | X) \equiv \mathbb{E}(L(\hat{\theta}, \theta ) | X) &= \sum_{\theta \in \Theta} \mathbb{I}(\hat{\theta} \neq \theta) \pi (\theta | X ) \\[8pt] &= 1 - \sum_{\theta \in \Theta} \mathbb{I}(\hat{\theta} =\theta) \pi (\theta | X ) \\[8pt] &= 1 - \pi (\hat{\theta} | X). \end{aligned} \end{equation}$$

Minimising the expected loss is equivalent to maximising the posterior probability $\pi (\hat{\theta} | X)$, which occurs when $\hat{\theta}$ is the posterior mode.

Continuous case: Suppose that the unknown parameter $\theta$ is a continuous random variable, and let $\hat{\theta}$ denote the estimator of this parameter. Then the zero-one loss function is defined as:

$$L(\hat{\theta} , \theta) = 1 - \delta (\hat{\theta} - \theta).$$

where $\delta$ denotes the Dirac delta function. This gives expected loss:

$$\begin{equation} \begin{aligned} \bar{L}(\hat{\theta} | X) \equiv \mathbb{E}(L(\hat{\theta}, \theta ) | X) &= \int_{\Theta} (1- \delta (\hat{\theta} - \theta)) \pi (\theta | X ) \ d \theta \\[8pt] &= 1 - \int_{\Theta} \delta (\hat{\theta} =\theta) \pi (\theta | X ) \ d \theta \\[8pt] &= 1 - \pi (\hat{\theta} | X). \end{aligned} \end{equation}$$

Minimising the expected loss is equivalent to maximising the posterior density $\pi (\hat{\theta} | X)$, which occurs when $\hat{\theta}$ is the posterior mode. Note here that the Dirac delta function is not strictly a real function; it is actually a distribution on the real line.

Unification with Lebesgue-Stieltjes integration: We can unify these two cases by treating the loss function as a distribution for $\theta$ with the distribution function:

$$H(\hat{\theta}-\theta) = \mathbb{I}(\hat{\theta} \geqslant \theta).$$

We can then write the expected loss as:

$$\begin{equation} \begin{aligned} \bar{L}(\hat{\theta} | X) \equiv \mathbb{E}(L(\hat{\theta}, \theta ) | X) &= \int_{\Theta} \pi (\theta | X ) \ d H(\hat{\theta}-\theta) \\[8pt] &= 1 - \pi (\hat{\theta} | X). \end{aligned} \end{equation}$$

This case encompasses both the discrete and continuous cases. In fact, this treatment implicitly uses the Dirac delta function, since the loss distribution in this case is the distribution function for the Dirac delta function.

Best Answer

Related Solutions

Solved – Minimax estimator for the mean of a Poisson distribution

Bayesian Inference – Difference Between Risk Function in Bayesian Inference and Supervised Learning

Related Question