Bayesian Credible Intervals – Decision-Theoretic Justification of Procedures

bayesiancredible-intervaldecision-theory

(To see why I wrote this, check the comments below my answer to this question.)

Type III errors and statistical decision theory

Giving the right answer to the wrong question is sometimes called a Type III error. Statistical decision theory is a formalization of decision-making under uncertainty; it provides a conceptual framework that can help one avoid type III errors. The key element of the framework is called the loss function. It takes two arguments: the first is (the relevant subset of) the true state of the world (e.g., in parameter estimation problems, the true parameter value $\theta$); the second is an element in the set of possible actions (e.g., in parameter estimation problems, the estimate $\hat{\theta})$. The output models the loss associated with every possible action with respect to every possible true state of the world. For example, in parameter estimation problems, some well known loss functions are:

the absolute error loss $L(\theta, \hat{\theta}) = |\theta – \hat{\theta}|$
the squared error loss $L(\theta, \hat{\theta}) = (\theta – \hat{\theta})^2$
Hal Varian's LINEX loss $L(\theta, \hat{\theta}; k) = \exp(k(\theta – \hat{\theta})) – k(\theta – \hat{\theta}) – 1,\text{ } k \ne0$

Examining the answer to find the question

There's a case one might attempt to make that type III errors can be avoided by focusing on formulating a correct loss function and proceeding through the rest of the decision-theoretic approach (not detailed here). That's not my brief – after all, statisticians are well equipped with many techniques and methods that work well even though they are not derived from such an approach. But the end result, it seems to me, is that the vast majority of statisticians don't know and don't care about statistical decision theory, and I think they're missing out. To those statisticians, I would argue that reason they might find statistical decision theory valuable in terms of avoiding Type III error is because it provides a framework in which to ask of any proposed data analysis procedure: what loss function (if any) does the procedure cope with optimally? That is, in what decision-making situation, exactly, does it provide the best answer?

Posterior expected loss

From a Bayesian perspective, the loss function is all we need. We can pretty much skip the rest of decision theory — almost by definition, the best thing to do is to minimize posterior expected loss, that is, find the action $a$ that minimizes $\tilde{L}(a) = \int_{\Theta}L(\theta, a)p(\theta|D)d\theta$.

(And as for non-Bayesian perspectives? Well, it is a theorem of frequentist decision theory — specifically, Wald's Complete Class Theorem — that the optimal action will always be to minimize Bayesian posterior expected loss with respect to some (possibly improper) prior. The difficulty with this result is that it is an existence theorem gives no guidance as to which prior to use. But it fruitfully restricts the class of procedures that we can "invert" to figure out exactly which question it is that we're answering. In particular, the first step in inverting any non-Bayesian procedure is to figure out which (if any) Bayesian procedure it replicates or approximates.)

Hey Cyan, you know this is a Q&A site, right?

Which brings me – finally – to a statistical question. In Bayesian statistics, when providing interval estimates for univariate parameters, two common credible interval procedures are the quantile-based credible interval and the highest posterior density credible interval. What are the loss functions behind these procedures?

Best Answer

In univariate interval estimation, the set of possible actions is the set of ordered pairs specifying the endpoints of the interval. Let an element of that set be represented by $(a, b),\text{ } a \le b$.

Highest posterior density intervals

Let the posterior density be $f(\theta)$. The highest posterior density intervals correspond to the loss function that penalizes an interval that fails to contain the true value and also penalizes intervals in proportion to their length:

$L_{HPD}(\theta, (a, b); k) = I(\theta \notin [a, b]) + k(b – a), \text{} 0 < k \le max_{\theta} f(\theta)$,

where $I(\cdot)$ is the indicator function. This gives the expected posterior loss

$\tilde{L}_{HPD}((a, b); k) = 1 - \Pr(a \le \theta \le b|D) + k(b – a)$.

Setting $\frac{\partial}{\partial a}\tilde{L}_{HPD} = \frac{\partial}{\partial b}\tilde{L}_{HPD} = 0$ yields the necessary condition for a local optimum in the interior of the parameter space: $f(a) = f(b) = k$ – exactly the rule for HPD intervals, as expected.

The form of $\tilde{L}_{HPD}((a, b); k)$ gives some insight into why HPD intervals are not invariant to a monotone increasing transformation $g(\theta)$ of the parameter. The $\theta$-space HPD interval transformed into $g(\theta)$ space is different from the $g(\theta)$-space HPD interval because the two intervals correspond to different loss functions: the $g(\theta)$-space HPD interval corresponds to a transformed length penalty $k(g(b) – g(a))$.

Quantile-based credible intervals

Consider point estimation with the loss function

$L_q(\theta, \hat{\theta};p) = p(\hat{\theta} - \theta)I(\theta < \hat{\theta}) + (1-p)(\theta - \hat{\theta})I(\theta \ge \hat{\theta}), \text{ } 0 \le p \le 1$.

The posterior expected loss is

$\tilde{L}_q(\hat{\theta};p)=p(\hat{\theta}-\text{E}(\theta|\theta < \hat{\theta}, D)) + (1 - p)(\text{E}(\theta | \theta \ge \hat{\theta}, D)-\hat{\theta})$.

Setting $\frac{d}{d\hat{\theta}}\tilde{L}_q=0$ yields the implicit equation

$\Pr(\theta < \hat{\theta}|D) = p$,

that is, the optimal $\hat{\theta}$ is the $(100p)$% quantile of the posterior distribution, as expected.

Thus to get quantile-based interval estimates, the loss function is

$L_{qCI}(\theta, (a,b); p_L, p_U) = L_q(\theta, a;p_L) + L_q(\theta, b;p_U)$.

Related Solutions

Solved – Credible interval for Bayesian posterior of variance and mean, and posterior predictive of normal

Fixed it. Edited code now works properly, and my new results are:

[1] "% of means within 95% CI:"
[1] 0.95325
[1] "% of variance within 95% CI:"
[1] 0.94997
[1] "% of observations within posterior predictive CI:"
[1] 0.94891

Hopefully this code will help someone else.

Solved – Computing Bayesian Credible Intervals for Bayesian Regression

Unfortunately, the interval that you are looking for is not uniquely determined. Essentially, what you need is the Posterior Predictive Density (PPD, see https://en.wikipedia.org/wiki/Posterior_predictive_distribution), which is the density function of new/unseen data given the observed data. This PPD depends on the posterior distribution of the parameters, $\theta_1, ..., \theta_n$ in your case. It can be written as

$p(y^* | y, x, x^*) = \int p(y^*, \theta | y, x, x^*) d\theta = \int p(y^* | \theta) p (\theta | y, x, x^*)d\theta$

where $y^*$ represents the unseen response data, $y$ represents the known response data, $x$ and $x^*$ represent the predictor values that correspond to $y$ and $y^*$, and $\theta$ represents the parameters. The last factor in the final integral is the posterior distribution of $\theta$ given $y, x, x^*$. As the PPD depends on the posterior distribution of the parameters, this, in turn, depends on the prior distribution of the parameters (and on the chosen data model / likelihood function). This means that for each prior you may choose, your posterior distribution changes (and, as a result, your interval as well).

Usually, when choosing completely uninformative (i.e. flat) priors, along with a Normal likelihood for the response values given the predictors, the results of a Bayesian analysis overlap with those of a frequentist analysis. Then again, flat priors are usually a poor choice for such a model.

When you know which priors you want to use for your analysis, it may be possible to compute the PPD analytically, but in many cases this is simply impossible. I'd recommend using a tool like Stan (http://mc-stan.org) to draw samples from the posterior distribution and then use those to determine a credible interval for your parameters and your new (simulated) data.

Hope this helps!