Solved – Metropolis Hastings Algorithm – Prior vs Proposal vs Numerator of Bayes Theorem

bayesianmarkov-chain-montecarlometropolis-hastingsposteriorprior

I've been using this technique in 'black-box' form for a little while as a physics student.

I have been struggling to understand what's happening under the hood for some time and I think I almost have it – but I'm mixed up about several things. I'm trying to fit together the prior distribution, proposal distribution, and Bayes Theorem

What is the difference between the prior $\mathcal{P}$ and the jumping distribution $\mathcal{Q}$ in the simple MH algorithm? Are these the same thing? I know the jumping distribution is used to draw a candidate point $X_c$ that is 'near' the current point $X_i$. Often this is something like a Gaussian function centered at $X_i$, therefore producing $X_c$ that is usually locally 'close' to $X_i$. That's what I keep finding, using resources such as this youtube video.

Elsewhere, like in this amazing tutorial, I see that the prior is a surface above the parameter space in question shaped in a way that makes sense given the problem, and this surface is used to weight the posterior in conjunction with the appropriate likelihood function. For instance maybe we believe in advance that scores on an particular exam are Gaussian distributed, with known mean. So the posterior is weighted by this assumption and the MCMC will return samples with a higher density in that region for that particular parameter.

That makes sense with Bayes Theorem, as the posterior is proportional to the product of the prior and the likelihood.

$P(A|B) \propto P(B|A)P(A)$

I am not seeing where this happens into the MH algorithm. I know these three things (prior, proposal, and Bayes Theorem) are all wrapped up in the 'acceptance ratio' part of the algorithm. Can someone break it down for me like I'm five years old and show me exactly where and how those pieces fit together?

Finally I'm having a lot of trouble understanding how we define likelihood for an arbitrary statistical model. One explanation from the first video is it's an exponential function that literally evaluates the difference between the proposed fit and the real data:

$\mathcal{L} = e^{-(Data – Model)}$

Okay, so this function takes a higher value in regions of parameter space that produce better fits. But how can we say this is exactly the actual likelihood (i.e. the $P(B|A)$ part of Bayes Theorem)? This doesn't seem like it's guaranteed to be at all points proportional to what we want…

Any help, guidance, or resources greatly appreciated! Thanks so much for reading to this point and bearing with me.

Best Answer

As you say, the three elements used in MH are the proposal (jumping) probability, the prior probability, and the likelihood. Say that we want to estimate the posterior distribution of a parameter $\Theta$ after observing some data $\mathbf x$, that is, $p(\Theta|\mathbf{x})$. Assume that we know the prior distribution $p(\Theta)$, that summarizes our beliefs about the value of $\Theta$ before we observe any data.

Now, it is usually impossible to compute the posterior distribution analytically. Instead, an ingenious method is to create an abstract Markov chain whose states are values of $\Theta$, such that that the stationary distribution of such chain is the desired posterior distribution. Metropolis-Hastings (MH) is a schema (not the only one, e.g. there's Gibbs sampling) to construct such a chain, that requires to carefully select a jumping (or proposal) distribution $q(\Theta|\theta)$. In order to go from one value of $\Theta$, denoted as $\theta$, to the next, say $\theta'$, we apply the following procedure:

  1. Sample a candidate (or proposed) $\theta^*$ as the next value, by sampling from $q(\Theta|\theta)$, where $\theta$ is the current value.
  2. Accept the candidate value with a probability given by the MH acceptance ratio, given by the formula: $$ \alpha(\theta,\theta^*) = \min\left[1,\frac{p(\theta^*|\mathbf{x})\;q(\theta|\theta^*)}{p(\theta|\mathbf{x})\;q(\theta^*|\theta)} \right]. $$ By applying Bayes rule the the posterior probability terms in the formula above, we get: $$ \alpha(\theta,\theta^*) = \min\left[1,\frac{p(\theta^*)\;p(\mathbf{x}|\theta^*)\;q(\theta|\theta^*)}{p(\theta)\;p(\mathbf{x}|\theta)\;q(\theta^*|\theta)} \right]. $$

After iterating this process "enough" times, we are left with a collection of points that approximates the posterior distribution.

A counterintuitive thing about the formula above is that the proposal probability of the candidate value appears at the denominator, while the "reverse" proposal probability (i.e. going from the proposed to the original value) is at the numerator. This is so that the overall transition distribution resulting from this process ensures a necessary property of the Markov chain called detailed balance. I found this paper quite helpful on this topic.

Now, it is perfectly possible to use the prior distribution itself as the proposal distribution: $q(\Theta|\theta)=p(\Theta)$. Note that in this case the proposal distribution is not conditional on the current value of $\Theta$, but that is not a problem in theory. If we substitute this in the formula for $\alpha$ above, and carry out some simplifications, we obtain: $$ \alpha(\theta,\theta^*) = \min\left[1,\frac{p(\mathbf{x}|\theta^*)}{p(\mathbf{x}|\theta)} \right]. $$ What is left is just the ratio of the likelihoods. This is a very simple approach and usually not very efficient, but may work for simple problems.

Regarding the likelihood, I think it really depends on what your model is. Regarding the formula you write, I don't really understand what is going on. What are $Data$ and $Model$ in there?

Related Question