Metropolis-Hastings Algorithm – Using the Log of the Density in MCMC

gibbsmarkov-chain-montecarlometropolis-hastingsrejection sampling

Does Metropolis-Hastings work with the log of the proposal and the density to be sampled from?
That is, say we want to sample from a density $\pi(x)$, using a proposal $q(x|x^{old})$, will the Metropolis-Hastings work with $\log(\pi(x))$ and $\log(q(x|x^{old}))$ as well?

When constructing a Gibbs sampler, we often encounter full conditional distributions that are non-conjugate. Techniques to sample from them include ARS,^1,2 ARMS,³ and Slice sampling.⁴ These techniques have the convenient feature that they can take the log of a density to be sampled from. They are technically Metropolis-Hastings samplers, so there are cases where my question will be answered in the affirmative. But is this a general feature of the Metropolis-Hastings algorithm?

The reason for my question is that you often store the log of the density when designing a Gibbs sampler, especially when using an object-oriented language and one of the three samplers mentioned above. If you have $\log( \pi(x) )$ stored, you could take $e^{ \log(\pi(x)) } = \pi(x)$ and use Metropolis-Hastings, but that can create issues with numerical overflow and underflow.

I have been unable to find a reference that explains or provides a proof to this question.

EDIT:

I didn't mean to ask whether I can sample from $\log(\pi(x) )$, since that makes no sense. My question was whether the Metropolis-Hastings can work if I pass it $\log(\pi(x) ) $. That is, is there a way to construct the algorithm that only uses the "better behaved" $\log(\pi(x) ) )$, rather than $\pi(x)$? The latter is usually a product of multiple other densities, and it can get really big really quick. Sums of log-densities are easier to use in algorithms than products of densities.

References:

¹Gilks, W., Wild, P.: Adaptive Rejection Sampling for Gibbs sampling, Applied Statistics 41(2), 337–348, 1992

²Gilks, W.: Derivative-free Adaptive rejection sampling for Gibbs sampling, Baysian Statistics 4, Oxford University Press, Oxford, 641–649, Eds: Bernardo, J., Berger, j., Dawid, A., Smith, A., 1992

³Gilks, W., Best, N., Tan, K.: Adaptive Rejection Metropolis Sampling within Gibbs sampling, Applied Statistics 44(4), 455–472, 1995

⁴Neal, Radford M: Slice sampling, Annals of Statistics 31(3), 705–741, 2003

Best Answer

As hinted at by @Tim, the solution was quite simple. A function implementing the Metropolis-Hastings can take $\log(\pi(x) )$ and $\log(q(x|x^{old} ) )$, but then everything will have to happen on a log scale. Let $\alpha$ be the acceptance probability of the Metropolis-Hastings update and $x'$ be the current value of the sampler. Then we propose a new $x$ with acceptance probability of:

$$ \alpha(x | x') = \min\left(1 , \frac{\pi(x) q(x'|x) }{\pi(x') q(x|x') } \right) $$ or in log terms $$ \log\big( \alpha(x | x') \big) = \min\Big( 0 , \log(\pi(x)) + \log(q(x'|x)) - \log(\pi(x')) - \log( q(x|x') ) \Big) \text{.} $$

Then $\log(\alpha)$ can then be used to accept/reject the proposed $x$.

This is a useful interpretation of the Metropolis-hastings that has practical benefits. It is used in this textbook: http://mcmcinirt.stat.cmu.edu/archives/320

That is, an implementation of the Metropolis-Hastings that takes the log of the density to be sampled from, and the log of the proposal. Logs are convenient to use since they are not limited to the range of IEEE doubles, and they don't suffer from numerical over and underflow, as pointed out by @whuber.

Related Solutions

Solved – the difference between Metropolis-Hastings, Gibbs, Importance, and Rejection sampling

As detailed in our book with George Casella, Monte Carlo statistical methods, these methods are used to produce samples from a given distribution, with density $f$ say, either to get an idea about this distribution, or to solve an integration or optimisation problem related with $f$. For instance, to find the value of $$\int_{\mathcal{X}} h(x) f(x)\text{d}x\qquad h(\mathcal{X})\subset \mathbb{R}$$ or the mode of the distribution of $h(X)$ when $X\sim f(x)$ or a quantile of this distribution.

To compare the Monte Carlo and Markov chain Monte Carlo methods you mention on relevant criteria requires one to set the background of the problem and the goals of the simulation experiment, since the pros and cons of each will vary from case to case.

Here are a few generic remarks that most certainly do not cover the complexity of the issue:

Accept-reject methods are intended to provide an i.i.d. sample from $f$. To achieve this, one designs an algorithm that takes as input a random number of uniform variates $u_1,u_2,\ldots$, and returns a value $x$ that is a realisation from $f$. The pros are that there is no approximation in the method: the outcome is truly an i.i.d. sample from $f$. The cons are many: (i) designing the algorithm by finding an envelope of $f$ that can be generated may be very costly in human time; (ii) the algorithm may be inefficient in computing time, i.e., requires many uniforms to produce a single $x$; (iii) those performances are decreasing with the dimension of $X$. In short, such methods cannot be used for simulating one or a few simulations from $f$ unless they are already available in a computer language like R.
Markov chain Monte Carlo (MCMC) methods are extensions of i.i.d. simulations methods when i.i.d. simulation is too costly. They produce a sequence of simulations $(x_t)_t$ which limiting distribution is the distribution $f$. The pros are that (i) less information about $f$ is needed to implement the method; (ii) $f$ may be only known up to a normalising constant or even as an integral$$f(x)\propto\int_{\mathcal{Z}} \tilde{f}(x,z)\text{d}z$$ and still be associated with an MCMC method; (iii) there exist generic MCMC algorithms to produce simulations $(x_t)_t$ that require very little calibration; (iv) dimension is less of an issue as large dimension targets can be broken into conditionals of smaller dimension (as in Gibbs sampling). The cons are that (i) the simulations $(x_t)_t$ are correlated, hence less informative than i.i.d. simulations; (ii) the validation of the method is only asymptotic, hence there is an approximation in considering $x_t$ for a fixed $t$ as a realisation of $f$; (iii) convergence to $f$ (in $t$) may be so slow that for all practical purposes the algorithm does not converge; (iv) the universal validation of the method means there is an infinite number of potential implementations, with an equally infinite range of efficiencies.
Importance sampling methods are originally designed for integral approximations, namely generating from the wrong target $g(x)$ and compensating by an importance weight $$f(x)/g(x)\,.$$ The resulting sample is thus weighted, which makes the comparison with the above awkward. However, importance sampling can be turned into importance sampling resampling by using an additional resampling step based on the weights. The pros of importance sampling resampling are that (i) generation from an importance target $g$ can be cheap and recycled for different targets $f$; (ii) the "right" choice of $g$ can lead to huge improvements compared with regular or MCMC sampling; (iii) importance sampling is more amenable to numerical integration improvement, like for instance quasi-Monte Carlo integration; (iv) it can be turn into adaptive versions like population Monte Carlo and sequential Monte Carlo. The cons are that (i) resampling induces inefficiency (which can be partly corrected by reducing the noise as in systematic resampling or qMC); (ii) the "wrong" choice of $g$ can lead to huge losses in efficiency and even to infinite variance; (iii) importance has trouble facing large dimensions and its efficiency diminishes quickly with the dimension; (iv) the method may be as myopic as local MCMC methods in missing important regions of the support of $f$; (v) resampling induces a bias due to the division by the sum of the weights.

In conclusion, a warning that there is no such thing as an optimal simulation method. Even in a specific setting like approximating an integral $$\mathcal{I}=\int_{\mathcal{X}} h(x) f(x)\text{d}x\,,$$ costs of designing and running different methods intrude as to make a global comparison very delicate, if at all possible, while, from a formal point of view, they can never beat the zero variance answer of returning the constant "estimate" $$\hat{\mathcal{I}}=\int_{\mathcal{X}} h(x) f(x)\text{d}x$$ For instance, simulating from $f$ is very rarely if ever the best option. This does not mean that methods cannot be compared, but that there always is a possibility for an improvement, which comes with additional costs.

Solved – slice sampling within a Gibbs sampler

I found two references. This one details the algorithm, but the publicly-available pages that I could see on Google Books don't prove that it works.

@inbook{cruz,
    Author = {Cruz, Marcelo G. and Peters, Gareth W. and Shevchenko, Pavel V.},
    Chapter = {7.6.2: Generic univariate auxiliary variable Gibbs sampler: slice sampler},
    Publisher = {Wiley},
    Title = {Fundamental Aspects of Operational Risk and Insurance Analytics: A Handbook of Operational Risk},
    Year = {2015}}

Another one, also partially available on Google Books for free, seems to allude to slice-sampling-within-Gibbs.

@inbook{banerjee,
    Author = {Banerjee, Sudipto and Carlin, Bradley P. and Gelfand, Alan E. },
    Chapter = {9.4.1: Regression in the Gaussian case},
    Edition = {2nd},
    Publisher = {CRC Press},
    Title = {Hierarchical Modeling and Analysis for Spatial Data},
    Year = {2015}}

I agree, it would be nice to find solid proof of validity, preferably in a good journal.

EDIT: Even Gelman's famous "Bayesian Data Analysis" (3rd ed) mentions the idea. In Section 12.3: Further extensions to Gibbs and Metropolis, under the "Slice sampling" heading, the end of the first paragraph says

Slice sampling refers to the application of iterative simulation algorithms on this uniform distribution. The details of implementing an effective slice sampling procedure can be complicated, but the method can be applied in great generality and can be especially useful for sampling one-dimensional conditional distributions in a Gibbs sampling structure.

Neal's famous 2003 slice sampling paper is where I think it was first suggested. The first paragraph of Section 4 says

Slice sampling is simplest when only one (real-valued) variable is being updated. This will of course be the case when the distribution of interest is univariate, but more typically, the single-variable slice sampling methods of this section will be used to sample from a multivariate distribution for x = (x1,...,xn) by sampling repeatedly for each variable in turn. To update xi, we must be able to compute a function, fi(xi), that is proportional to p(xi|{xj}j̸=i), where {xj}j̸=i are the values of the other variables.

Yet I still can find no proof of correctness.

Best Answer

Related Solutions

Solved – the difference between Metropolis-Hastings, Gibbs, Importance, and Rejection sampling

Solved – slice sampling within a Gibbs sampler

Related Question