Unfortunately, the interval that you are looking for is not uniquely determined. Essentially, what you need is the Posterior Predictive Density (PPD, see https://en.wikipedia.org/wiki/Posterior_predictive_distribution), which is the density function of new/unseen data given the observed data. This PPD depends on the posterior distribution of the parameters, $\theta_1, ..., \theta_n$ in your case. It can be written as
$p(y^* | y, x, x^*) = \int p(y^*, \theta | y, x, x^*) d\theta = \int p(y^* | \theta) p (\theta | y, x, x^*)d\theta$
where $y^*$ represents the unseen response data, $y$ represents the known response data, $x$ and $x^*$ represent the predictor values that correspond to $y$ and $y^*$, and $\theta$ represents the parameters. The last factor in the final integral is the posterior distribution of $\theta$ given $y, x, x^*$. As the PPD depends on the posterior distribution of the parameters, this, in turn, depends on the prior distribution of the parameters (and on the chosen data model / likelihood function). This means that for each prior you may choose, your posterior distribution changes (and, as a result, your interval as well).
Usually, when choosing completely uninformative (i.e. flat) priors, along with a Normal likelihood for the response values given the predictors, the results of a Bayesian analysis overlap with those of a frequentist analysis. Then again, flat priors are usually a poor choice for such a model.
When you know which priors you want to use for your analysis, it may be possible to compute the PPD analytically, but in many cases this is simply impossible. I'd recommend using a tool like Stan (http://mc-stan.org) to draw samples from the posterior distribution and then use those to determine a credible interval for your parameters and your new (simulated) data.
Hope this helps!
First of all, Frequentist methods also provide a distribution over possible answers. It is just that we do not call them distributions because of a philosophical point. Frequentists consider parameters of a distribution as a fixed quantity. It is not allowed to be random; therefore, you cannot talk about distributions over parameters in a meaningful way. In frequentist methods, we estimate confidence intervals which can be thought of as distributions if we are letting go of the philosophical details. But in Bayesian methods the fixed parameters are allowed to be random; therefore, we talk about the (prior and posterior) distributions over the parameters.
Second, it is not always the case that only a single value is used at the end. Many applications require us to use the entire posterior distributions in subsequent analysis. In fact, to derive a suitable point estimate, full distribution is required. A well known example is risk minimization. Another example is model identification in natural sciences in the presence of significant uncertainties.
Third, Bayesian inference has many benefits over a frequentist analysis (not just the one that you metion):
Ease of interpretation: It is hard to understand what a confidence interval is and why it is not a probability distributions. The reason is simply a philosophical one as I have explained above briefly. The probability distributions in Bayesian inference are easier to understand becuase that is how we typically tend to think in uncertain situations.
Ease of implementation: It is easier to get Bayesian probability distributions than frequentist confidence intervals. Frequentist analysis requires us to identify a sampling distribution which is very difficult for many real world applications.
Assumptions of the model are explicit in Bayesian inference: For example, many frequentist analyses assume asymptotic Normality for computing the confidence interval. But no such assumptions are required for Bayesian inference. Moreover, the assumptions made in Bayesian inference are more explicit.
Prior information: Most importantly, Bayesian inference allows us to incorporate prior knowledge into the analyses in a relatively simple manner. In frequentist methods, regularization is used to incorporate prior information which is very difficult to do in many problems. It is not to say that incorporation of prior information is easy in Bayesian analysis; but it is easier than that in frequentist analysis.
Edit: A particularly good example of ease-of-interpretation of Bayesian methods is their use in probabilistic machine learning (ML). There are several method developed in ML literature with the backdrop of Bayesian ideas. For example, relevance vector machines (RVMs), Gaussian processes (GPs).
As Richard hardy pointed, this answer gives the reasons why someone would want to use Bayesian analysis. There are good reasons to use frequentist analysis also. In general, frequentist methods are computationally more efficient. I would suggest reading first 3-4 chapters of 'Statistical Decision Theory and Bayesian Analysis' by James Berger which gives a balanced view on this issue but with an emphasis on Bayesian practice.
To elaborate on the use of entire distribution rather a point estimate to make a decision in risk minimization, a simple example follows. Suppose you have to choose between different parameters of a process to make a decision, and the cost of choosing wrong parameters is $L(\hat{\theta},\theta)$ where $\hat{\theta}$ is the parameter estimate and $\theta$ is assumed to be true parameter. Now given the posterior distribution $p(\hat{\theta}|D)$ (where $D$ denotes observations)we can minimize expected loss which is $\int L(\hat{\theta},\theta)p(\hat{\theta}|D)d\hat{\theta}$. This expected loss can be minimized for every value of $\theta$ and the $\theta$ value with minimum expected loss can be used for decision making. This will result in a point estimate; but the value of the point estimate depends upon the loss function.
Based on a comment by Alexis, here is why frequentist confidence intervals are harder to interpret. Confidence intervals are (as Alexis has pointed out): A plausible range of estimates for a parameter given a Type I error rate. One naturally asks where does this possible range come from. The frequentist answer is that it comes from the sampling distribution. But then the question is we only observe one sample? The frequentist answer is we infer what other samples could have been observed based on the likelihood function. But if we are inferring other samples based on likelihood function, those samples should have a probability distribution over them, and, consequently, the confidence interval should be interpreted as a probability distribution. But for the philosophical reason mentioned above, this last extension of probability distribution to confidence interval is not allowed. Compare this to a Bayesian statement: A 95% credible-region means that the true parameter lies in this region with 95% probability.
A side note on philosophical differences between Bayesian and frequentist theory (based on a comment by ): In frequentist theory probability of an event is relative frequencies of that event over a large number of repeated trials of the experiment in question. Therefore, the parameters of a distribution are fixed because they stay the same in all the repetitions of the experiment. In Bayesian theory, the probabilities are degrees of belief in that an event would occur for in a single trial of the experiment in question. The problem with frequentist definition of probability is that it cannot be used to define probabilities in many real world applications. As an example, try to define the probability that I am typing this answer an android smartphone. Frequentist would say that the probability is either $0$ or $1$. While the Bayesian definition allows you to choose an appropriate number between $0$ and $1$.
Best Answer
I might start by saying that a Bayesian model cannot systematically overfit (or underfit) data that are drawn from the prior predictive distribution, which is the basis for a procedure to validate that Bayesian software is working correctly before it is applied to data collected from the world.
But it can overfit a single dataset drawn from the prior predictive distribution or a single dataset collected from the world in the sense that the various predictive measures applied to the data that you conditioned on look better than those same predictive measures applied to future data that are generated by the same process. Chapter 6 of Richard McElreath's Bayesian book is devoted to overfitting.
The severity and frequency of overfitting can be lessened by good priors, particularly those that are informative about the scale of an effect. By putting vanishing prior probability on implausibly large values, you discourage the posterior distribution from getting overly excited by some idiosyncratic aspect of the data that you condition on that may suggest an implausibly large effect.
The best ways of detecting overfitting involve leave-one-out cross-validation, which can be approximated from a posterior distribution that does not actually leave any observations out of the conditioning set. There is an assumption that no individual "observation" [*] that you condition on has an overly large effect on the posterior distribution, but that assumption is checkable by evaluating the size of the estimate of the shape parameter in a Generalized Pareto distribution that is fit to the importance sampling weights (that are derived from the log-likelihood of an observation evaluated over every draw from the posterior distribution). If this assumption is satisfied, then you can obtain predictive measures for each observation that are as if that observation had been omitted, the posterior had been drawn from conditional on the remaining observations, and the posterior predictive distribution had been constructed for the omitted observation. If your predictions of left out observations suffer, then your model was overfitting to begin with. These ideas are implemented in the loo package for R, which includes citations such as here and there.
As far as distilling to a single number goes, I like to calculate the proportion of observations that fall within 50% predictive intervals. To the extent that this proportion is greater than one half, the model is overfitting, although you need more than a handful of observations in order to cut through the noise in the inclusion indicator function. For comparing different models (that may overfit), the expected log predictive density (which is calculated by the
loo
function in the loo package) is a good measure (proposed by I.J. Good) because it takes into account the possibility that a more flexible model may fit the available data better than a less flexible model but is expected to predict future data worse. But these ideas can be applied to the expectation of any predictive measure (that may be more intuitive to practitioners); see theE_loo
function in the loo package.[*] You do have to choose what constitutes an observation in a hierarchical model. For example, are you interested in predicting a new patient or a new time point for an existing patient? You can do it either way, but the former requires that you (re)write the likelihood function to integrate out the patient-specific parameters.