Marginal Likelihood – How to Compute the Marginal Likelihood from MCMC Samples in Bayesian Analysis

bayesianlikelihoodmachine learningmarkov-chain-montecarlosampling

This is a recurring question (see this post, this post and this post), but I have a different spin.

Suppose I have a bunch of samples from a generic MCMC sampler. For each sample $\theta$, I know the value of the log likelihood $\log f(\textbf{x} | \theta)$ and of the log prior $\log f(\theta)$. If it helps, I also know the value of the log likelihood per data point, $\log f(x_i | \theta)$ (this information helps with certain methods, such as WAIC and PSIS-LOO).

I want to obtain a (crude) estimate of the marginal likelihood, just with the samples that I have, and possibly a few other function evaluations (but without rerunning an ad hoc MCMC).

First of all, let's clear the table. We all know that the harmonic estimator is the worst estimator ever. Let's move on.
If you are doing Gibbs sampling with priors and posteriors in closed form, you can use Chib's method; but I am not sure how to generalize outside of those cases. There are also methods that require you to modify the sampling procedure (such as via tempered posteriors), but I am not interested in that here.

The approach I am thinking of consists of approximating the underlying distribution with a parametric (or nonparametric) shape $g(\theta)$, and then figuring out the normalization constant $Z$ as a 1-D optimization problem (i.e., the $Z$ that minimizes some error between $Z g(\theta)$ and $f(\textbf{x}|\theta) f(\theta)$, evaluated on the samples). In the simplest case, suppose that the posterior is roughly multivariate normal, I can fit $g(\theta)$ as a multivariate normal and get something similar to a Laplace approximation (I might want to use a few additional function evaluations to refine the position of the mode). However, I could use as $g(\theta)$ a more flexible family such as a variational mixture of multivariate $t$ distributions.

I appreciate that this method works only if $Z g(\theta)$ is a reasonable approximation to $f(\textbf{x}|\theta) f(\theta)$, but any reason or cautionary tale of why it would be very unwise to do it? Any reading that you would recommend?

The fully nonparametric approach uses some nonparametric family, such as a Gaussian process (GP), to approximate $\log f(\textbf{x}|\theta) + \log f(\theta)$ (or some other nonlinear transformation thereof, such as the square root), and Bayesian quadrature to implicitly integrate over the underlying target (see here and here). This seems to be an interesting alternative approach, but analogous in spirit (also, note that GPs would be unwieldy in my case).

Best Answer

The extension by Chib and Jeliazkov (2001) unfortunately gets quickly costly or highly variable, which is a reason why it is not much used outside Gibbs sampling cases.

While there are many ways and approaches to the normalisation constant $\mathfrak{Z}$ estimation problem (as illustrated by the quite diverse talks in the Estimating Constant workshop we ran last week at the University of Warwick, slides available there), some solutions do exploit directly the MCMC output.

  1. As you mentioned, the harmonic mean estimator of Newton and Raftery (1994) is almost invariably poor for having an infinite variance. However, there are ways to avoid the infinite variance curse by using instead a finite support target in the harmonic mean identity $$\int \dfrac{\alpha(\theta)}{\pi(\theta)f(x|\theta)}\text{d}\pi(\theta|x)=\frac{1}{\mathfrak{Z}}$$ by picking $\alpha$ as the indicator of an HPD region for the posterior. This ensures finite variance by removing the tails in the harmonic mean. (Details are to be found in a paper I wrote with Darren Wraith and in a chapter about normalising constants written with Jean-Michel Marin.) In short, the method recycles the MCMC output $\theta_1,\ldots,\theta_M$ by identifying the $\beta$ (20% say) largest values of the target $\pi(\theta)f(x|\theta)$ and creating $\alpha$ as a uniform over the union of the balls centred at those largest density (HPD) simulations $\theta^0_i$ and with radius $\rho$, meaning the estimate of the normalising constant $\mathfrak{Z}$ is given by $$\hat{\mathfrak{Z}}^{-1}=\underbrace{\frac{1}{\beta M^2}\sum_{m=1}^M}_{\text{double sum over}\\\beta M\text{ ball centres }\theta_i^0\\\text{and $M$ simulations } \theta_m} \underbrace{\mathbb{I}_{(0,\rho)}(\min_i||\theta_m-\theta^0_i||)\{\pi(\theta_m)f(x|\theta_m)\}^{-1}\big/\overbrace{\pi^{d/2}\rho^d\Gamma(d/2+1)^{-1}}^{\text{volume of ball with radius $\rho$}}}_{\dfrac{\beta M\alpha(\theta_m)}{\pi(\theta_m)f(x|\theta_m)}}$$ if $d$ is the dimension of $\theta$ (corrections apply for intersecting balls) and if $\rho$ is small enough for the balls to never intersect (meaning that at best only one indicator on the balls is different from zero). The explanation for the $\alpha M^2$ denominator is that this is a double sum of $\beta M^2$ terms: $$ \frac{1}{\beta M}\sum_{i=1}^{\beta M} \underbrace{\frac{1}{M}\sum_{m=1}^M {\cal U}(\theta_i^0,\rho)(\theta_m)}_{\text{same as with $\min$}} \times \frac{1}{\pi(\theta_m)f(x|\theta_m)} $$ with each term in $\theta_m$ integrating to ${\mathfrak{Z}}^{-1}$.

  2. Another approach is to turn the normalising constant $\mathfrak{Z}$ into a parameter. This sounds like a statistical heresy but the paper by Guttmann and Hyvärinen (2012) convinced me of the opposite. Without getting too much into details, the neat idea therein is to turn the observed log-likelihood $$ \sum_{i=1}^n f(x_i|\theta) - n \log \int \exp f(x|\theta) \text{d}x $$ into a joint log-likelihood $$ \sum_{i=1}^n[f(x_i|\theta)+\nu]-n\int\exp[f(x|\theta)+\nu]\text{d}x $$ which is the log-likelihood of a Poisson point process with intensity function $$ \exp\{ f(x|\theta) + \nu +\log n\} $$ This is an alternative model in that the original likelihood does not appear as a marginal of the above. Only the modes coincide, with the conditional mode in ν providing the normalising constant. In practice, the above Poisson process likelihood is unavailable and Guttmann and Hyvärinen (2012) offer an approximation by means of a logistic regression. To connect even better with your question, Geyer's estimate is a MLE, hence solution to a maximisation problem.

  3. A connected approach is Charlie Geyer's logistic regression approach. The fundamental notion is to add to the MCMC sample from $\pi(\theta|x)$ another sample from a known target, e.g., your best guess at $\pi(\theta|x)$, $g(\theta)$, and to run logistic regression on the index of the distribution behind the data (1 for $\pi(\theta|x)$ and 0 for $g(\theta)$). With the regressors being the values of both densities, normalised or not. This happens to be directly linked with Gelman and Meng (1997) bridge sampling, which also recycles samples from different targets. And later versions, like Meng's MLE.

  4. A different approach that forces one to run a specific MCMC sampler is Skilling's nested sampling. While I [and others] have some reservations on the efficiency of the method, it is quite popular in astrostatistics and cosmology, with software available like MultiNest, PolyChord and UltraNest.

  5. A last [potential if not always possible] solution is to exploit the Savage-Dickey representation of the Bayes factor in the case of an embedded null hypothesis. If the null writes as $H_0: \theta=\theta_0$ about a parameter of interest and if $\xi$ is the remaining [nuisance] part of the parameter of the model, assuming a prior of the form $\pi_1(\theta)\pi_2(\xi)$, the Bayes factor of $H_0$ against the alternative writes as $$\mathfrak{B}_{01}(x)=\dfrac{\pi^\theta(\theta_0|x)}{\pi_1(\theta_0)}$$ where $\pi^\theta(\theta_0|x)$ denotes the marginal posterior density of $\theta$ at the specific value $\theta_0$. In case the marginal density under the null $H_0: \theta=\theta_0$ $$m_0(x)=\int_\Xi f(x|\theta_0,\xi)\pi_2(\xi)\text{d}\xi$$ is available in closed form, one can derive the marginal density for the unconstrained model $$m_a(x)=\int_{\Theta\times\Xi} f(x|\theta,\xi)\pi_1(\theta)\pi_2(\xi)\text{d}\theta\text{d}\xi$$ from the Bayes factor. (This Savage-Dickey representation relies on specific versions of three different densities and so is fraught with danger, not even mentioning the computational challenge of producing the marginal posterior.)

[Here is a set of slides I wrote about estimating normalising constants for a NIPS workshop last December.]

Related Question