Bayesian Inference – What to Do with Bayesian Estimates at the End of the Day

bayesianinference

I have often heard that in certain instances, it can be more beneficial to use Bayesian based methods because they provide "a distribution of possible answers" (i.e. the posterior distribution) instead of a single answer (as done in the frequentist case). However, it seems that at the end of the day, the analyst is still required to transform this "distribution of possible answers" into a single answer.

For example : if a Bayesian model is used to estimate the posterior distribution of "mu", the analyst is still required to either take the MAP or the Expectation of this distribution to return a final answer.

Is this the main benefit of Bayesian models? If the priors are correctly specified, the credible intervals associated with the expectation of the posterior distribution (of the parameter of interest) are more reliable?

Best Answer

First of all, Frequentist methods also provide a distribution over possible answers. It is just that we do not call them distributions because of a philosophical point. Frequentists consider parameters of a distribution as a fixed quantity. It is not allowed to be random; therefore, you cannot talk about distributions over parameters in a meaningful way. In frequentist methods, we estimate confidence intervals which can be thought of as distributions if we are letting go of the philosophical details. But in Bayesian methods the fixed parameters are allowed to be random; therefore, we talk about the (prior and posterior) distributions over the parameters.

Second, it is not always the case that only a single value is used at the end. Many applications require us to use the entire posterior distributions in subsequent analysis. In fact, to derive a suitable point estimate, full distribution is required. A well known example is risk minimization. Another example is model identification in natural sciences in the presence of significant uncertainties.

Third, Bayesian inference has many benefits over a frequentist analysis (not just the one that you metion):

  1. Ease of interpretation: It is hard to understand what a confidence interval is and why it is not a probability distributions. The reason is simply a philosophical one as I have explained above briefly. The probability distributions in Bayesian inference are easier to understand becuase that is how we typically tend to think in uncertain situations.

  2. Ease of implementation: It is easier to get Bayesian probability distributions than frequentist confidence intervals. Frequentist analysis requires us to identify a sampling distribution which is very difficult for many real world applications.

  3. Assumptions of the model are explicit in Bayesian inference: For example, many frequentist analyses assume asymptotic Normality for computing the confidence interval. But no such assumptions are required for Bayesian inference. Moreover, the assumptions made in Bayesian inference are more explicit.

  4. Prior information: Most importantly, Bayesian inference allows us to incorporate prior knowledge into the analyses in a relatively simple manner. In frequentist methods, regularization is used to incorporate prior information which is very difficult to do in many problems. It is not to say that incorporation of prior information is easy in Bayesian analysis; but it is easier than that in frequentist analysis.

Edit: A particularly good example of ease-of-interpretation of Bayesian methods is their use in probabilistic machine learning (ML). There are several method developed in ML literature with the backdrop of Bayesian ideas. For example, relevance vector machines (RVMs), Gaussian processes (GPs).

As Richard hardy pointed, this answer gives the reasons why someone would want to use Bayesian analysis. There are good reasons to use frequentist analysis also. In general, frequentist methods are computationally more efficient. I would suggest reading first 3-4 chapters of 'Statistical Decision Theory and Bayesian Analysis' by James Berger which gives a balanced view on this issue but with an emphasis on Bayesian practice.

To elaborate on the use of entire distribution rather a point estimate to make a decision in risk minimization, a simple example follows. Suppose you have to choose between different parameters of a process to make a decision, and the cost of choosing wrong parameters is $L(\hat{\theta},\theta)$ where $\hat{\theta}$ is the parameter estimate and $\theta$ is assumed to be true parameter. Now given the posterior distribution $p(\hat{\theta}|D)$ (where $D$ denotes observations)we can minimize expected loss which is $\int L(\hat{\theta},\theta)p(\hat{\theta}|D)d\hat{\theta}$. This expected loss can be minimized for every value of $\theta$ and the $\theta$ value with minimum expected loss can be used for decision making. This will result in a point estimate; but the value of the point estimate depends upon the loss function.

Based on a comment by Alexis, here is why frequentist confidence intervals are harder to interpret. Confidence intervals are (as Alexis has pointed out): A plausible range of estimates for a parameter given a Type I error rate. One naturally asks where does this possible range come from. The frequentist answer is that it comes from the sampling distribution. But then the question is we only observe one sample? The frequentist answer is we infer what other samples could have been observed based on the likelihood function. But if we are inferring other samples based on likelihood function, those samples should have a probability distribution over them, and, consequently, the confidence interval should be interpreted as a probability distribution. But for the philosophical reason mentioned above, this last extension of probability distribution to confidence interval is not allowed. Compare this to a Bayesian statement: A 95% credible-region means that the true parameter lies in this region with 95% probability.

A side note on philosophical differences between Bayesian and frequentist theory (based on a comment by ): In frequentist theory probability of an event is relative frequencies of that event over a large number of repeated trials of the experiment in question. Therefore, the parameters of a distribution are fixed because they stay the same in all the repetitions of the experiment. In Bayesian theory, the probabilities are degrees of belief in that an event would occur for in a single trial of the experiment in question. The problem with frequentist definition of probability is that it cannot be used to define probabilities in many real world applications. As an example, try to define the probability that I am typing this answer an android smartphone. Frequentist would say that the probability is either $0$ or $1$. While the Bayesian definition allows you to choose an appropriate number between $0$ and $1$.