Solved – Metropolis Hastings – Acceptance ratio, proposal and lkelihood

markov-chain-montecarlomcmc-acceptance-ratemetropolis-hastings

From a previous post :

First to explain the MH algorithm, it's used to approximate
numerically a target distribution, in this case $p(\theta|D)$. At
each stage of the algorithm:

A value $\theta_{proposed}$ is proposed using the jumping or proposal distribution.

An acceptance ratio is calculated, equal to $\frac{p(\theta_{proposed}|D)}{p(\theta_{current}|D)}$. Because we
cannot calculate $p(\theta|D)$ directly, we leverage the proportional
expression of Bayes rule and calculate this quotient using the
products of likelihood and prior corresponding to
$\theta_{proposed}, \theta_{current}$. That is, the acceptance ratio
is:

$$\frac{p(D|\theta_{proposed})p(\theta_{proposed})}{p(D|\theta_{current})p(\theta_{current})}$$

If this ratio exceeds one—intuitively, if the proposed parameter value is more likely given data and prior—we accept this proposal. If
not, we accept it with probability equal to the ratio.

As more and more values are sampled in this way, the trace of $\theta$
values more and more closely approximates the true distribution of
$\theta$.

Question 1) In my case, I have 6 parameters to estimate : Are $p(\theta_{\text{proposed}})$ and
$p(\theta_{\text{current}})$ different ? especially if I am using uniform law.

Question 2) If yes, I could simply write :

$$\text{Acceptance}_{\text{Ratio}} = \dfrac{p(D|\theta_{\text{proposed}})}{p(D|\theta_{\text{current}})}$$

Question 3) If I use a normal distribution with variance and mean for each of 6 parametric, have I still the symetry of proposal ?

Question 4) How to express this acceptance ratio as a function of Likehood function or cost function used to find the 6 parameters ?

My cost function is equal to :

$$J(\theta)=(y-H\,\theta)(y- H\,\theta)^T$$

with $d = H\,\theta$ where $d$ is the response of system.

UPDATE 1: Here is below the Metropolis-Hastings I am using :

    %%%%%% MCMC - Metropolis Algorithm %%%%%%%
    % Declare parameters of array
    nb = 1e5;
    % Metropolis loop
    % Index of main loop
    i = 1;
    % Reject number
    j = 0;
    % Accept number
    k = 0;
    % Starting values
    l1 = [10, 3];
    l2 = [10, 3];
    l3 = [10, 3];
    l4 = [10, 3];
    l5 = [10, 3];
    l6 = [10, 3];

    % Initial param array
    paramsArray = zeros(nb,6);
    paramsArray(1,1:6) = [l1(1) + l1(2)*rand(1), l2(1) + l2(2)*rand(1), ...
                          l3(1) + l3(2)*rand(1), l4(1) + l4(2)*rand(1), ...
                          l5(1) + l5(2)*rand(1), l6(1) + l6(2)*rand(1)];
    % Init parameters vector
    pStart = paramsArray(1,1:6)';
    % Standard deviation vectors for proposal distributon
    gamMatrix = diag([l1(2), l2(2), l3(2), l4(2), l5(2), l6(2)]);
    % Starting prior distribution f(p|d)
    w_x = Crit_J(pStart,D)*exp(-((pStart(1)-l1(1))^2/(2*l1(2)^2)+(pStart(2)-l2(1))^2/(2*l2(2)^2)+ ...
                                (pStart(3)-l3(1))^2/(2*l3(2)^2)+(pStart(4)-l4(1))^2/(2*l4(2)^2)+ ...
                                (pStart(5)-l5(1))^2/(2*l5(2)^2)+(pStart(6)-l6(1))^2/(2*l6(2)^2)));
    while (i <= nb)
      if (i == 1)
        % First random 
        ptest = pStart;
      else
        % Other random
        ptest = abs(pStart + gamMatrix*randn(6,1));
      end
  % Upper term for acceptance ratio
  w_y = Crit_J(ptest,D)*exp(-((ptest(1)-l1(1))^2/(2*l1(2)^2)+(ptest(2)-l2(1))^2/(2*l2(2)^2)+ ...
                              (ptest(3)-l3(1))^2/(2*l3(2)^2)+(ptest(4)-l4(1))^2/(2*l4(2)^2)+ ...
                              (ptest(5)-l5(1))^2/(2*l5(2)^2)+(ptest(6)-l6(1))^2/(2*l6(2)^2)));
  % Generate u uniformly
  u = rand(1);
  % Ratio acceptance
  log_prob = log(w_y/w_x);
  % Test acceptation
  if (log(u) < log_prob)
    % Assing new paramsArray
    paramsArray(i,1:6) = ptest(1:6);
    w_x = Crit_J(paramsArray(i,1:6),D)*exp(-((ptest(1)-l1(1))^2/(2*l1(2)^2)+(ptest(2)-l2(1))^2/(2*l2(2)^2)+ ...
                                            (ptest(3)-l3(1))^2/(2*l3(2)^2)+(ptest(4)-l4(1))^2/(2*l4(2)^2)+ ...
                                            (ptest(5)-l5(1))^2/(2*l5(2)^2)+(ptest(6)-l6(1))^2/(2*l6(2)^2)));
    i = i+1;
    k = k+1;
  else
  % Assing to previous
    if (i ~= 1)
      paramsArray(i,1:6) = paramsArray(i-1,1:6);
      i = i+1;
      j = j+1;
    end
  end
end
disp('acceptationt : ratio');
disp(k/nb)
disp('reject : ratio');
disp(j/nb)
% Display mean of different parameters                                                   
disp('Parameters with Metropolis-Hastings :');
mean(paramsArray(:,1))
mean(paramsArray(:,2))
mean(paramsArray(:,3))
mean(paramsArray(:,4))
mean(paramsArray(:,5))
mean(paramsArray(:,6))

and my cost function (assimilated to Likelihood function) :

% Function  of cost
function cost = Crit_J(p,D)

% Compute the model corresponding to parameters p
[R,C] = size(D);
[Cols,Rows] = meshgrid(1:C,1:R);
% Model
Model = (1+((Rows-p(3)).^2+(Cols-p(4)).^2)/p(5)^2).^(-p(6));
model = Model(:);
d = D(:);
% Introduce H matrix 
H = [ model, ones(length(model),1)];
% Compute the cost function : taking absolute value
cost = abs((d-H*[p(1),p(2)]')'*(d-H*[p(1),p(2)]'));

end

If you could see the error …

Best Answer

The problem with your description of the Metropolis-Hastings algorithm is that your notation does not distinguish between the probability densities in the actual problem you are trying to solve, and the proposal density used in the algorithm. Your notation also fails to capture the fact that we are trying to simulate from the posterior distribution, but we only have a kernel of this distribution. A better description of the algorithm, which makes these distinctions in the notation, is as follows:

You start in a situation where you do not know the posterior $p(\theta|D)$, but you do know a kernel of this distribution $K(\theta|D) \propto p(\theta|D)$. You want to simulate values from the posterior. In the MH algorithm you start at an arbitrary parameter value $\theta_0$ and simulate using the following recursive scheme (which is a Markov chain):

We generate a proposed value $\theta'_{t+1}$ from the proposal density $g(\theta'_{t+1}|\theta_{t})$.

For the proposed value, we define the acceptance ratio:

$$A(\theta'_{t+1} | \theta_t) \equiv \frac{K(\theta'_{t+1}|D)}{K(\theta_{t}|D)} \cdot \frac{g(\theta_{t} | \theta'_{t+1})}{g(\theta'_{t+1} | \theta_t)}.$$

With probability $\min (A(\theta'_{t+1} | \theta_t), 1)$ we accept the proposed value and set $\theta_{t+1} = \theta'_{t+1}$. Otherwise we reject the proposed value and set $\theta_{t+1} = \theta_{t}$.

It can be shown that this Markov chain has stationary distribution $p(\theta|D)$, which is the posterior distribution of interest. Note that this is true even though the algorithm only uses a kernel of the distribution. We can therefore rely on the limiting properties of Markov chains to simulate from this posterior distribution. Usually this involves generating a small amount of 'burn-in' iterations followed by a series of auto-correlated simulations from the limiting stationary distribution. We can also rely on ergodic theorems to estimate the true posterior moments of functions of the parameter from the corresponding sample moments from the Markov chain.

Special case - symmetric proposal distribution: In many applications of the MH algorithm it is common to use a proposal density that is symmetric, in the sense that:

$$g(\theta'|\theta) = g(\theta|\theta') \quad \text{for all } \theta, \theta'.$$ (Note that a sufficient condition for this is that the density value depends on the parameters only through the norm $||\theta' - \theta||$.) In this special case the acceptance ratio reduces to:

$$A(\theta'_{t+1} | \theta_t) \equiv \frac{K(\theta'_{t+1}|D)}{K(\theta_{t}|D)}.$$

Now that we have a clearer explanation of the actual workings of the algorithm, I will try to answer your specific questions. (For consistency, I will translate your questions into notation that is consistent with my explanation of the algorithm.) Your Question 4 is unclear to me (there is no cost function in the algorithm so I don't know what you're referring to here), but I will answer the other three questions.

Question 1) Are $g(\theta'_{t+1} | \theta_t)$ and $g(\theta_t | \theta'_{t+1})$ different? What if I am using a uniform proposal distribution.

In the case where you use a proposal distribution that is symmetric (in the sense described above) the two proposal densities (with the argument and conditioning parameter switched) will be the same. Symmetry occurs in the case where you use the uniform proposal density that is centred around the conditioning value:

$$g(\theta' | \theta) \propto \mathbb{I}(|\theta' - \theta| \leqslant t).$$

In this case, switching the terms in the proposal density does not alter the value (i.e., they are not different). If you are using a uniform proposal density that is not centred around the conditioning value then this will not hold.

Question 2) If I am using a uniform proposal distribution, could I write the acceptance ratio without the ratio of proposal densities?

Assuming your uniform proposal distribution is centred around the conditioning parameter (and thus symmetric in the above sense), yes you can.

Question 3) If I use a normal distribution, will I still have symmetry of the proposal distribution?

Symmetry occurs if you use a normal distribution with mean equal to the conditioning parameter and variance independent of this parameter:

$$g(\theta' | \theta) = \text{N}(\theta' |\theta, \Sigma) \propto \exp \Big( -\frac{1}{2} (\theta' - \theta)^\text{T} \Sigma^{-1} (\theta' - \theta) \Big).$$

Related Solutions

Solved – Acceptance rates for Metropolis-Hastings with uniform candidate distribution

I believe that Weak convergence and optimal scaling of random walk Metropolis algorithms by Roberts, Gelman and Gilks is the source for the 0.234 optimal acceptance rate.

What the paper shows is that, under certain assumptions, you can scale the random walk Metropolis-Hastings algorithm as the dimension of the space goes to infinity to get a limiting diffusion for each coordinate. In the limit, the diffusion can be seen as "most efficient" if the acceptance rate takes the value 0.234. Intuitively, it is a tradeoff between making to many small accepted steps and making to many large proposals that get rejected.

The Metropolis-Hastings algorithm is not really an optimization algorithm, in contrast to simulated annealing. It is an algorithm that is supposed to simulate from the target distribution, hence the acceptance probability should not be driven towards 0.

Solved – Understanding Metropolis-Hastings with asymmetric proposal distribution

The bibliography states that if q is a symmetric distribution the ratio q(x|y)/q(y|x) becomes 1 and the algorithm is called Metropolis. Is that correct?

Yes, this is correct. The Metropolis algorithm is a special case of the MH algorithm.

What about "Random Walk" Metropolis(-Hastings)? How does it differ from the other two?

In a random walk, the proposal distribution is re-centered after each step at the value last generated by the chain. Generally, in a random walk the proposal distribution is gaussian, in which case this random walk satisfies the symmetry requirement and the algorithm is metropolis. I suppose you could perform a "pseudo" random walk with an asymmetric distribution which would cause the proposals too drift in the opposite direction of the skew (a left skewed distribution would favor proposals toward the right). I'm not sure why you would do this, but you could and it would be a metropolis hastings algorithm (i.e. require the additional ratio term).

How does it differ from the other two?

In a non-random walk algorithm, the proposal distributions are fixed. In the random walk variant, the center of the proposal distribution changes at each iteration.

What if the proposal distribution is a Poisson distribution?

Then you need to use MH instead of just metropolis. Presumably this would be to sample a discrete distribution, otherwise you wouldn't want to use a discrete function to generate your proposals.

In any event, if the sampling distribution is truncated or you have prior knowledge of its skew, you probably want to use an asymmetric sampling distribution and therefore need to use metropolis-hastings.

Could someone give me a simple code (C, python, R, pseudo-code or whatever you prefer) example?

Here's metropolis:

Metropolis <- function(F_sample # distribution we want to sample
                      , F_prop  # proposal distribution 
                      , I=1e5   # iterations
               ){
  y = rep(NA,T)
  y[1] = 0    # starting location for random walk
  accepted = c(1)

  for(t in 2:I)    {
    #y.prop <- rnorm(1, y[t-1], sqrt(sigma2) ) # random walk proposal
    y.prop <- F_prop(y[t-1]) # implementation assumes a random walk. 
                             # discard this input for a fixed proposal distribution

    # We work with the log-likelihoods for numeric stability.
    logR = sum(log(F_sample(y.prop))) -
           sum(log(F_sample(y[t-1])))    

    R = exp(logR)

    u <- runif(1)        ## uniform variable to determine acceptance
    if(u < R){           ## accept the new value
      y[t] = y.prop
      accepted = c(accepted,1)
    }    
    else{
      y[t] = y[t-1]      ## reject the new value
      accepted = c(accepted,0)
    }    
  }
  return(list(y, accepted))
}

Let's try using this to sample a bimodal distribution. First, let's see what happens if we use a random walk for our propsal:

set.seed(100)

test = function(x){dnorm(x,-5,1)+dnorm(x,7,3)}

# random walk
response1 <- Metropolis(F_sample = test
                       ,F_prop = function(x){rnorm(1, x, sqrt(0.5) )}
                      ,I=1e5
                       )
y_trace1 = response1[[1]]; accpt_1 = response1[[2]]
mean(accpt_1) # acceptance rate without considering burn-in
# 0.85585   not bad

# looks about how we'd expect
plot(density(y_trace1))
abline(v=-5);abline(v=7) # Highlight the approximate modes of the true distribution

enter image description here

Now let's try sampling using a fixed proposal distribution and see what happens:

response2 <- Metropolis(F_sample = test
                            ,F_prop = function(x){rnorm(1, -5, sqrt(0.5) )}
                            ,I=1e5
                       )

y_trace2 = response2[[1]]; accpt_2 = response2[[2]]
mean(accpt_2) # .871, not bad

This looks ok at first, but if we take a look at the posterior density...

plot(density(y_trace2))

enter image description here

we'll see that it's completely stuck at a local maximum. This isn't entirely surprising since we actually centered our proposal distribution there. Same thing happens if we center this on the other mode:

response2b <- Metropolis(F_sample = test
                        ,F_prop = function(x){rnorm(1, 7, sqrt(10) )}
                        ,I=1e5
)

plot(density(response2b[[1]]))

We can try dropping our proposal between the two modes, but we'll need to set the variance really high to have a chance at exploring either of them

response3 <- Metropolis(F_sample = test
                        ,F_prop = function(x){rnorm(1, -2, sqrt(10) )}
                        ,I=1e5
)
y_trace3 = response3[[1]]; accpt_3 = response3[[2]]
mean(accpt_3) # .3958!

Notice how the choice of the center of our proposal distribution has a significant impact on the acceptance rate of our sampler.

plot(density(y_trace3))

enter image description here

plot(y_trace3) # we really need to set the variance pretty high to catch 
               # the mode at +7. We're still just barely exploring it

We still get stuck in the closer of the two modes. Let's try dropping this directly between the two modes.

response4 <- Metropolis(F_sample = test
                        ,F_prop = function(x){rnorm(1, 1, sqrt(10) )}
                        ,I=1e5
)
y_trace4 = response4[[1]]; accpt_4 = response4[[2]]

plot(density(y_trace1))
lines(density(y_trace4), col='red')

enter image description here

Finally, we're getting closer to what we were looking for. Theoretically, if we let the sampler run long enough we can get a representative sample out of any of these proposal distributions, but the random walk produced a usable sample very quickly, and we had to take advantage of our knowledge of how the posterior was supposed to look to tune the fixed sampling distributions to produce a usable result (which, truth be told, we don't quite have yet in y_trace4).

I'll try to update with an example of metropolis hastings later. You should be able to see fairly easily how to modify the above code to produce a metropolis hastings algorithm (hint: you just need to add the supplemental ratio into the logR calculation).

Best Answer

Related Solutions

Solved – Acceptance rates for Metropolis-Hastings with uniform candidate distribution

Solved – Understanding Metropolis-Hastings with asymmetric proposal distribution

Related Question