Why are some sampling algorithms better than others

sampling

How exactly do we show some sampling algorithms can perform better in certain types of situations compared to others?

In my computational statistics seminar, we covered the following statistical methods on simulating random data from probability distributions:

  • Acceptance-Rejection Method: Can simulate data from any distribution, and does not involve taking the inverse of a distribution. It uses the ratio of the target distribution and some candidate distribution to sample from the target distributions. However, it is said to be not very effective sometimes (e.g. many rejections, does not fully explore the sample space), but I don't know why.
  1. Select a value of $c$ such that for all $x$ , the following is true: $c \cdot g(x) \geq f(x)$ (here $g(x)$ is the proposal distribution and $f(x)$ is the target distribution)
  2. Generate a proposal $x$ from a known distribution $g(x)$.
  3. Generate a random number $u$ from a uniform distribution on $[0, 1]$.
  4. Accept the proposal $x$ if $u \leq \frac{f(x)}{c \cdot g(x)}$, where $f(x)$ is the target distribution and $c$ is a constant such that $c \cdot g(x) \geq f(x)$ for all $x$. If the proposal is not accepted, go back to step 2.
  • Metropolis-Hastings (MCMC) : Can also simulate data from any distributions, but is somehow better than Acceptance-Rejection Method but I don't know why.
  1. Start with an initial point $x_0$.
  2. For each step, generate a proposal $x'$ from a distribution $Q(x'|x)$, which depends on the current position $x$.
  3. Compute the acceptance ratio $r = \frac{f(x')/Q(x'|x)}{f(x)/Q(x|x')}$, where $f(x)$ is the target distribution.
  4. Accept the proposal $x'$ with probability $\min(1, r)$. If the proposal is not accepted, stay at the current position $x$.

Is there any mathematical intuition (e.g. some dummy problem) where we can see a mathematical problem that demonstrates why in certain distributions, Metropolis-Hastings will be more effective (e.g. less rejections, getting stuck less) compared to Acceptance-Rejection Method?

I am looking for some mathematical example like when they teach Optimization. E.g. There is some mathematical function we are trying to optimize with many Saddle Points and local minima. A first order optimization method will falsely think it has found the optimum point (due to its stopping condition) and stop whereas a second order optimization method will be able to recognize that this is not the optimum point and continue to move on.

Re: Thomas's point on why Acceptance-Rejection Sampling becomes difficult in Bayesian Situations where the denominator of the Posterior Distribution is not known (due to a complicated integral). Here is my guess:

For a posterior distribution $f(\theta | x)$, likelihood function $L(\theta | x)$ and prior distribution $h(\theta)$, we can write using Bayes Law:

$$f(\theta | x) = \frac{L(\theta | x) \cdot h(\theta)}{\int L(\theta | x) \cdot h(\theta) d\theta}$$

When it comes to determining the ratio $c$, guaranteeing that this inequality is true for all $x$ becomes impossible because without fully knowing $f(x)$ (i.e. impossible without knowing the denominator of $f(x)):

$$f(x) \leq c \cdot g(x)$$
$$c \geq \frac{f(x)}{g(x)}$$

Thus, Acceptance-Rejection Sampling method becomes impossible in this case.

Best Answer

Both rejection sampling and Metropolis-Hastings are really classes of algorithms rather than single algorithms. However, one key difference between rejection sampling and Metropolis-Hastings algorithms is that rejection sampling requires you to know the density (or probability mass function) and Metropolis-Hastings can sample from densities that you only know up to a normalising constant.

That is, if you know the density is $f(x)=cg(x)$ for some constant $c$, you can't do rejection sampling without knowing $c$, because $c$ is part of the acceptance probability. You can do Metropolis-Hastings sampling because $c$ cancels out of the Metropolis acceptance ratio.

In Bayesian inference, it's typical for the posterior density to be given only up to an unknown normalising constant (the denominator of Bayes' Rule) so a class of sampling algorithms that works in that case is very valuable.

In the other direction, rejection sampling (if you can do it) can be preferable because it produces independent samples where MCMC produces dependent ones.

You can also combine the approaches; for example, BUGS uses adaptive rejection sampling for log-concave full conditional densities.

Related Question