Can MCMC sample any probability distributions

markov-chain-montecarlomarkov-processmeasure-theorymixed type datastochastic-processes

I have three fundamental questions related to MCMC. I would appreciate the help on any one of those.

  1. The most fundamental question in MCMC field, which I can't find a reference, is: Can MCMC generate any probability distribution. Put it in another way: is any probability distribution the limiting distribution of some Markov chain. Although I see some claim that: "MCMC can be used to sample from any probability distribution", I do feel this suspicious.
  2. There is another question. In Statistical and Computational Inverse Problem by Jari Kaipio and Erkki Somersalo, when introducing MCMC, they assume the kernel somehow has the "inertia":

Let $P$ be any transition kernel. Given a point $x \in \mathbb{R}^n$, the kernel either move from $x$ to another point $y \in \mathbb{R}^n$ or it proposed no move away from $x$. This allows us to split the kernel into two parts:
$$
P(x,B)=\int_BK(x,y)dy+r(x)\chi_B(x)
$$

where $B$ is a Borel set and $\chi_B$ is the characteristic function. Later when introducing Metropolis-Hastings, they used such kernel (with inertia).

The motivation behind this confuses me. I'm wondering if there is any distribution must be generated using a transition kernel with inertia. For continuous distribution, my intuition tells me that choosing such kernel will only slow down the convergence of Metropolis-Hastings.

  1. The third question maybe relate to 2nd, but I'm not sure. The question is that: How do we generate mixed distribution. For those mixed distribution whose total mass of the discrete part is computable, I guess I can use more than one independent Markov chains. Say the weight for the mixture between discrete and continuous distribution is 0.4 and 0.6, then the machine has 0.4 probability to go to the discrete chain and 0.6 go to the continuous. However, How to generate samples that the total probability mass on the discrete part is not computable, would Markov chains with inertia help us?

Best Answer

1. Can MCMC generate any probability distribution

Imho, this is an ill-posed question in that it depends on how the (target) probability distribution is defined in the original problem. If this distribution is not provided through its density (up to a normalising constant), or as the marginal of another distribution with available density, MCMC may prove too challenging to be realistically implemented.

For instance, Approximate Bayesian computation (ABC) methods were introduced to handle well-defined probability distributions whose density was unavailable, even through a manageable completion by latent variables. A (toy) illustration is provided by the $g$-and_$k$ distributions which are often used as a benchmark in ABC research, due to the unavailability of the density function attached with the quantile function $$q(r) = a + b\left(1 + 0.8\dfrac{1 − \exp(−g\zeta(r))}{1 + \exp(−g\zeta(r))}\right) (1 + \zeta(r)^2)^κ \zeta(r),$$ where $\zeta(r)$ refers to the $r$-th quantile of the standard Normal distribution (even though an MCMC approach proves feasible up to numerical precision error). This family of $g$-and_$k$ distributions is however simulable since the quantile function is closed-form. Still, many other distributions may be (perfectly) defined via their moment generating function, the Fourier transform of their density, a fixed point equation, or in any other indirect way that makes a standard MCMC method like Metropolis-Hastings sampling too complex or time-consuming to implement.

2. any distribution must be generated using a transition kernel with inertia

In the representation, $$P(x,B)=\int_B K(x,y)\,\text dy+r(x)\chi_B(x)$$ $\chi_B(\cdot)$ denotes the indicator function, not the characteristic function. This representation allows for the existence of an atom in $x$ for the transition kernel, but since $r(\cdot)\equiv 0$ is a possible choice of function, this is not overly restrictive (even though it excludes other atoms unless the measure $\text dy$ is also mixed) and applies for instance for a Gibbs sampler (which has no atom in continuous settings).

2'. any distribution must be generated using a transition kernel with inertia

The point mass at $x$ is necessary to validate the Metropolis-Hastings algorithm by imposing a rejection step correcting for the initial kernel $K(x,\cdot)$ missing the target (as stationary distribution). There exist other MCMC algorithms avoiding rejection but they are not necessarily more efficient. (The ultimate counter-example is the random walk Markov chain that moves at each iteration but never converges.)

3. How do we generate mixed distributions

As a generic answer, a mixed distribution (if understood in a restrictive sense as any distribution with a Lebesgue continuous part and a finite or countable collection of positive probability atoms) is a standard distribution with a density wrt a dominating measure that is, e.g., the sum of the Lebesgue measure $\lambda$ and of the counting measure over the collection of atoms. Thus, any proposal kernel $K(x,\cdot)$ that is dominated by this very same measure is a valid choice for running an MCMC algorithm such as Metropolis-Hastings in such circumstances. In practice, this means that the proposal $K(x,\cdot)$ must give positive probability (for some values of the current value $x$ of the Markov chain) to the atoms of the target. (For, otherwise, the target cannot be the limiting distribution.)

Obviously, if the discrete part of the target distribution can be directly simulated and if furthermore the discrete part mass of the target distribution is known, simulation of a sample from the target can split between simulation of a (sub)sample from the continuous part, possibly via MCMC, and simulation of a (sub)sample from the discrete part, possibly without a call to MCMC. But this is not necessary for the MCMC algorithm to remain valid.

Furthermore, when the discrete part mass is unknown, an MCMC approach (with the proper atoms) offers the advantage of producing an exact simulation (asymptotically) since it does not require the normalization constant of the target to be known.

Related Question