Solved – be Bayesian when the model is wrong

bayesianmisspecificationmodelingphilosophical

Edits: I have added a simple example: inference of the mean of the $X_i$. I have also slightly clarified why the credible intervals not matching confidence intervals is bad.

I, a fairly devout Bayesian, am in the middle of a crisis of faith of sorts.

My problem is the following. Assume that I want to analyse some IID data $X_i$. What I would do is:

  • first, propose a conditional model:
    $$ p(X|\theta) $$

  • Then, choose a prior on $\theta$:
    $$ p(\theta) $$

  • Finally, apply Bayes' rule, compute the posterior: $p(\theta | X_1 \dots X_n )$ (or some approximation to it if it should be uncomputable) and answer all questions I have about $\theta$

This is a sensible approach: if the true model of the data $X_i$ is indeed "inside" of my conditional (it corresponds to some value $\theta_0$), then I can call upon statistical decision theory to say that my method is admissible (see Robert's "The Bayesian choice" for details; "All of statistics" also gives a clear account in the relevant chapter).

However, as everybody knows, assuming that my model is correct is fairly arrogant: why should nature fall neatly inside the box of the models which I have considered? It is much more realistic to assume that the real model of the data $p_{true}(X)$ differs from $p(X|\theta)$ for all values of $\theta$. This is usually called a "misspecified" model.

My problem is that, in this more realistic misspecified case, I don't have any good arguments for being Bayesian (i.e: computing the posterior distribution) versus simply computing the Maximum Likelihood Estimator (MLE):

$$ \hat \theta_{ML} = \arg \max_\theta [ p(X_1 \dots X_n |\theta) ] $$

Indeed, according to Kleijn, v.d Vaart (2012), in the misspecified case, the posterior distribution:

  • converges as $n\rightarrow \infty $ to a dirac distribution centered at a $\hat \theta_{ML} $

  • does not have the correct variance (unless two values just happen to be same) in order to ensure that credible intervals of the posterior match confidence intervals for $\theta$. (Note that, while confidence intervals are obviously something that Bayesians don't care about excessively, this qualitatively means that the posterior distribution is intrinsically wrong, as it implies that its credible intervals do not have correct coverage)

Thus, we are paying a computational premium (Bayesian inference, in general, is more expensive than MLE) for no additional properties

Thus, finally, my question: are there any arguments, whether theoretical or empirical, for using Bayesian inference over the simpler MLE alternative when the model is misspecified?

(Since I know that my questions are often unclear, please let me known if you don't understand something: I'll try to rephrase it)

Edit: let's consider a simple example: infering the mean of the $X_i$ under a Gaussian model (with known variance $\sigma$ to simplify even further).
We consider a Gaussian prior: we denote $\mu_0$ the prior mean, $\beta_0$ the inverse variance of the prior. Let $\bar X$ be the empirical mean of the $X_i$. Finally, note: $\mu = (\beta_0 \mu_0 + \frac{n}{\sigma^2} \bar X) / (\beta_0 + \frac{n}{\sigma^2} )$.

The posterior distribution is:

$$ p(\theta |X_1 \dots X_n)\; \propto\; \exp\!\Big( – (\beta_0 + \frac{n}{\sigma^2} ) (\theta – \mu)^2 / 2\Big) $$

In the correctly specified case (when the $X_i$ really have a Gaussian distribution), this posterior has the following nice properties

  • If the $X_i$ are generated from a hierarchical model in which their shared mean is picked from the prior distribution, then the posterior credible intervals have exact coverage. Conditional on the data, the probability of $\theta$ being in any interval is equal to the probability that the posterior ascribes to this interval

  • Even if the prior isn't correct, the credible intervals have correct coverage in the limit $n\rightarrow \infty$ in which the prior influence on the posterior vanishes

  • the posterior further has good frequentist properties: any Bayesian estimator constructed from the posterior is guaranteed to be admissible, the posterior mean is an efficient estimator (in the Cramer-Rao sense) of the mean, credible intervals are, asymptotically, confidence intervals.

In the misspecified case, most of these properties are not guaranteed by the theory. In order to fix ideas, let's assume that the real model for the $X_i$ is that they are instead Student distributions. The only property that we can guarantee (Kleijn et al) is that the posterior distribution concentrates on the real mean of the $X_i$ in the limit $n \rightarrow \infty$. In general, all the coverage properties would vanish. Worse, in general, we can guarantee that, in that limit, the coverage properties are fundamentally wrong: the posterior distribution ascribes the wrong probability to various regions of space.

Best Answer

I consider Bayesian approach when my data set is not everything that is known about the subject, and want to somehow incorporate that exogenous knowledge into my forecast.

For instance, my client wants a forecast of the loan defaults in their portfolio. They have 100 loans with a few years of quarterly historical data. There were a few occurrences of delinquency (late payment) and just a couple of defaults. If I try to estimate the survival model on this data set, it'll be very little data to estimate and too much uncertainty to forecast.

On the other hand, the portfolio managers are experienced people, some of them may have spent decades managing relationships with borrowers. They have ideas around what the default rates should be like. So, they're capable of coming up with reasonable priors. Note, not the priors which have nice math properties and look intellectually appealing to me. I'll chat with them and extract their experiences and knowledge in the form of those priors.

Now Bayesian framework will provide me with mechanics to marry the exogenous knowledge in the form of priors with the data, and obtain the posterior that is superior to both pure qualitative judgment and pure data driven forecast, in my opinion. This is not a philosophy and I'm not a Bayesian. I'm just using the Bayesian tools to consistently incorporate expert knowledge into the data-driven estimation.