Solved – Metropolis Hastings – Acceptance ratio, proposal and lkelihood

markov-chain-montecarlomcmc-acceptance-ratemetropolis-hastings

From a previous post :

First to explain the MH algorithm, it's used to approximate
numerically a target distribution, in this case $p(\theta|D)$. At
each stage of the algorithm:

  1. A value $\theta_{proposed}$ is proposed using the jumping or proposal distribution.
  2. An acceptance ratio is calculated, equal to $\frac{p(\theta_{proposed}|D)}{p(\theta_{current}|D)}$. Because we
    cannot calculate $p(\theta|D)$ directly, we leverage the proportional
    expression of Bayes rule and calculate this quotient using the
    products of likelihood and prior corresponding to
    $\theta_{proposed}, \theta_{current}$. That is, the acceptance ratio
    is:

$$\frac{p(D|\theta_{proposed})p(\theta_{proposed})}{p(D|\theta_{current})p(\theta_{current})}$$

  1. If this ratio exceeds one—intuitively, if the proposed parameter value is more likely given data and prior—we accept this proposal. If
    not, we accept it with probability equal to the ratio.

As more and more values are sampled in this way, the trace of $\theta$
values more and more closely approximates the true distribution of
$\theta$.

Question 1) In my case, I have 6 parameters to estimate : Are $p(\theta_{\text{proposed}})$ and
$p(\theta_{\text{current}})$ different ? especially if I am using uniform law.

Question 2) If yes, I could simply write :

$$\text{Acceptance}_{\text{Ratio}} = \dfrac{p(D|\theta_{\text{proposed}})}{p(D|\theta_{\text{current}})}$$

Question 3) If I use a normal distribution with variance and mean for each of 6 parametric, have I still the symetry of proposal ?

Question 4) How to express this acceptance ratio as a function of Likehood function or cost function used to find the 6 parameters ?

My cost function is equal to :

$$J(\theta)=(y-H\,\theta)(y- H\,\theta)^T$$

with $d = H\,\theta$ where $d$ is the response of system.

UPDATE 1: Here is below the Metropolis-Hastings I am using :

    %%%%%% MCMC - Metropolis Algorithm %%%%%%%
    % Declare parameters of array
    nb = 1e5;
    % Metropolis loop
    % Index of main loop
    i = 1;
    % Reject number
    j = 0;
    % Accept number
    k = 0;
    % Starting values
    l1 = [10, 3];
    l2 = [10, 3];
    l3 = [10, 3];
    l4 = [10, 3];
    l5 = [10, 3];
    l6 = [10, 3];

    % Initial param array
    paramsArray = zeros(nb,6);
    paramsArray(1,1:6) = [l1(1) + l1(2)*rand(1), l2(1) + l2(2)*rand(1), ...
                          l3(1) + l3(2)*rand(1), l4(1) + l4(2)*rand(1), ...
                          l5(1) + l5(2)*rand(1), l6(1) + l6(2)*rand(1)];
    % Init parameters vector
    pStart = paramsArray(1,1:6)';
    % Standard deviation vectors for proposal distributon
    gamMatrix = diag([l1(2), l2(2), l3(2), l4(2), l5(2), l6(2)]);
    % Starting prior distribution f(p|d)
    w_x = Crit_J(pStart,D)*exp(-((pStart(1)-l1(1))^2/(2*l1(2)^2)+(pStart(2)-l2(1))^2/(2*l2(2)^2)+ ...
                                (pStart(3)-l3(1))^2/(2*l3(2)^2)+(pStart(4)-l4(1))^2/(2*l4(2)^2)+ ...
                                (pStart(5)-l5(1))^2/(2*l5(2)^2)+(pStart(6)-l6(1))^2/(2*l6(2)^2)));
    while (i <= nb)
      if (i == 1)
        % First random 
        ptest = pStart;
      else
        % Other random
        ptest = abs(pStart + gamMatrix*randn(6,1));
      end
  % Upper term for acceptance ratio
  w_y = Crit_J(ptest,D)*exp(-((ptest(1)-l1(1))^2/(2*l1(2)^2)+(ptest(2)-l2(1))^2/(2*l2(2)^2)+ ...
                              (ptest(3)-l3(1))^2/(2*l3(2)^2)+(ptest(4)-l4(1))^2/(2*l4(2)^2)+ ...
                              (ptest(5)-l5(1))^2/(2*l5(2)^2)+(ptest(6)-l6(1))^2/(2*l6(2)^2)));
  % Generate u uniformly
  u = rand(1);
  % Ratio acceptance
  log_prob = log(w_y/w_x);
  % Test acceptation
  if (log(u) < log_prob)
    % Assing new paramsArray
    paramsArray(i,1:6) = ptest(1:6);
    w_x = Crit_J(paramsArray(i,1:6),D)*exp(-((ptest(1)-l1(1))^2/(2*l1(2)^2)+(ptest(2)-l2(1))^2/(2*l2(2)^2)+ ...
                                            (ptest(3)-l3(1))^2/(2*l3(2)^2)+(ptest(4)-l4(1))^2/(2*l4(2)^2)+ ...
                                            (ptest(5)-l5(1))^2/(2*l5(2)^2)+(ptest(6)-l6(1))^2/(2*l6(2)^2)));
    i = i+1;
    k = k+1;
  else
  % Assing to previous
    if (i ~= 1)
      paramsArray(i,1:6) = paramsArray(i-1,1:6);
      i = i+1;
      j = j+1;
    end
  end
end
disp('acceptationt : ratio');
disp(k/nb)
disp('reject : ratio');
disp(j/nb)
% Display mean of different parameters                                                   
disp('Parameters with Metropolis-Hastings :');
mean(paramsArray(:,1))
mean(paramsArray(:,2))
mean(paramsArray(:,3))
mean(paramsArray(:,4))
mean(paramsArray(:,5))
mean(paramsArray(:,6))

and my cost function (assimilated to Likelihood function) :

% Function  of cost
function cost = Crit_J(p,D)

% Compute the model corresponding to parameters p
[R,C] = size(D);
[Cols,Rows] = meshgrid(1:C,1:R);
% Model
Model = (1+((Rows-p(3)).^2+(Cols-p(4)).^2)/p(5)^2).^(-p(6));
model = Model(:);
d = D(:);
% Introduce H matrix 
H = [ model, ones(length(model),1)];
% Compute the cost function : taking absolute value
cost = abs((d-H*[p(1),p(2)]')'*(d-H*[p(1),p(2)]'));

end

If you could see the error …

Best Answer

The problem with your description of the Metropolis-Hastings algorithm is that your notation does not distinguish between the probability densities in the actual problem you are trying to solve, and the proposal density used in the algorithm. Your notation also fails to capture the fact that we are trying to simulate from the posterior distribution, but we only have a kernel of this distribution. A better description of the algorithm, which makes these distinctions in the notation, is as follows:

You start in a situation where you do not know the posterior $p(\theta|D)$, but you do know a kernel of this distribution $K(\theta|D) \propto p(\theta|D)$. You want to simulate values from the posterior. In the MH algorithm you start at an arbitrary parameter value $\theta_0$ and simulate using the following recursive scheme (which is a Markov chain):

  1. We generate a proposed value $\theta'_{t+1}$ from the proposal density $g(\theta'_{t+1}|\theta_{t})$.
  2. For the proposed value, we define the acceptance ratio:

$$A(\theta'_{t+1} | \theta_t) \equiv \frac{K(\theta'_{t+1}|D)}{K(\theta_{t}|D)} \cdot \frac{g(\theta_{t} | \theta'_{t+1})}{g(\theta'_{t+1} | \theta_t)}.$$

  1. With probability $\min (A(\theta'_{t+1} | \theta_t), 1)$ we accept the proposed value and set $\theta_{t+1} = \theta'_{t+1}$. Otherwise we reject the proposed value and set $\theta_{t+1} = \theta_{t}$.

It can be shown that this Markov chain has stationary distribution $p(\theta|D)$, which is the posterior distribution of interest. Note that this is true even though the algorithm only uses a kernel of the distribution. We can therefore rely on the limiting properties of Markov chains to simulate from this posterior distribution. Usually this involves generating a small amount of 'burn-in' iterations followed by a series of auto-correlated simulations from the limiting stationary distribution. We can also rely on ergodic theorems to estimate the true posterior moments of functions of the parameter from the corresponding sample moments from the Markov chain.

Special case - symmetric proposal distribution: In many applications of the MH algorithm it is common to use a proposal density that is symmetric, in the sense that:

$$g(\theta'|\theta) = g(\theta|\theta') \quad \text{for all } \theta, \theta'.$$ (Note that a sufficient condition for this is that the density value depends on the parameters only through the norm $||\theta' - \theta||$.) In this special case the acceptance ratio reduces to:

$$A(\theta'_{t+1} | \theta_t) \equiv \frac{K(\theta'_{t+1}|D)}{K(\theta_{t}|D)}.$$

Now that we have a clearer explanation of the actual workings of the algorithm, I will try to answer your specific questions. (For consistency, I will translate your questions into notation that is consistent with my explanation of the algorithm.) Your Question 4 is unclear to me (there is no cost function in the algorithm so I don't know what you're referring to here), but I will answer the other three questions.


Question 1) Are $g(\theta'_{t+1} | \theta_t)$ and $g(\theta_t | \theta'_{t+1})$ different? What if I am using a uniform proposal distribution.

In the case where you use a proposal distribution that is symmetric (in the sense described above) the two proposal densities (with the argument and conditioning parameter switched) will be the same. Symmetry occurs in the case where you use the uniform proposal density that is centred around the conditioning value:

$$g(\theta' | \theta) \propto \mathbb{I}(|\theta' - \theta| \leqslant t).$$

In this case, switching the terms in the proposal density does not alter the value (i.e., they are not different). If you are using a uniform proposal density that is not centred around the conditioning value then this will not hold.

Question 2) If I am using a uniform proposal distribution, could I write the acceptance ratio without the ratio of proposal densities?

Assuming your uniform proposal distribution is centred around the conditioning parameter (and thus symmetric in the above sense), yes you can.

Question 3) If I use a normal distribution, will I still have symmetry of the proposal distribution?

Symmetry occurs if you use a normal distribution with mean equal to the conditioning parameter and variance independent of this parameter:

$$g(\theta' | \theta) = \text{N}(\theta' |\theta, \Sigma) \propto \exp \Big( -\frac{1}{2} (\theta' - \theta)^\text{T} \Sigma^{-1} (\theta' - \theta) \Big).$$