Bayesian – How to Perform Transformation of Variables in Metropolis Hastings Algorithm

bayesianjacobianmarkov-chain-montecarlometropolis-hastings

Say I have a bunch of data from a Poisson distribution and I want to find out my posterior i.e. I'm data fitting:

$p(\lambda | X) \sim p(X|\lambda)p(\lambda)$

where $p(X|\lambda) = \frac{\exp(-\lambda)\lambda^x}{x!}$ so that my log-likelihood looks like:

$\log \mathcal{L}(\lambda|X) \sim x \log\lambda – \lambda$

Now as $\lambda > 0$, I transform my coordinates to be $\alpha = \log \lambda$. My new distribution looks like:

$p(X|\alpha) = \frac{\exp(-\exp(\alpha))\exp(\alpha x)}{x!}\cdot \bf{\exp(\alpha)}$

where the final $\exp(\alpha)$ comes from the Jacobian of the transformation.

This makes:

$\log \mathcal{L}(\lambda|X) \sim -\exp(\alpha) + \alpha x + \bf{\alpha}$

where the final $\bf{\alpha}$ in the new log-likelihood is from the earlier Jacobian.

The problem I'm having is that if I include that new $\alpha$ then my Metropolis-Hastings MCMC gives me a result that is incorrect. If I use a log-likehood that excludes it:

$\log \mathcal{L}(\lambda|X) \sim -\exp(\alpha) + \alpha x$

then I get correct results.

My question is:
Why does the Metropolis-Hastings algorithm not care about the Jacobian?

Best Answer

You do not need the $\alpha$ since it is a parameter. The change of variables formula applies to the variable with respect to which you are "integrating". It is $x$ in your case. So MH is right to demand that you remove the excess factor.

So what you really have is:

$$ p(X|\alpha) = \frac{\exp(-\exp(\alpha))\exp(\alpha x)}{x!} $$

had you applied some transformation to your $x$ variable - then the change of variables foremula should be used.

EDIT To understand what's going on, think of a normal RV $X \sim \mathcal{N}(\mu ,\sigma^2)$. So $p(X|\mu,\sigma^2)$ is the density. If you transform $\mu$ with any transformation $f$, you get the new variable is $Y \sim \mathcal{N}(f(\mu) ,\sigma^2)$ and no jacobian is necessary. I hope you agree (if not, I'll have to write more in tex...).

If you want $\mathcal{P}(X\in A|\alpha)$ you'd integrate $x$ and keep $\alpha$ fixed - that's what I mean when I say "integrate". Probability is all about integration, after all.

So in the end you have $p(x|\alpha)$ with no extra jacobian term. Then proceed as usual with bayes' rule etc and you'll get the "right" density.

Related Solutions

Metropolis-Hastings – Understanding Asymmetric Proposal Distribution

The bibliography states that if q is a symmetric distribution the ratio q(x|y)/q(y|x) becomes 1 and the algorithm is called Metropolis. Is that correct?

Yes, this is correct. The Metropolis algorithm is a special case of the MH algorithm.

What about "Random Walk" Metropolis(-Hastings)? How does it differ from the other two?

In a random walk, the proposal distribution is re-centered after each step at the value last generated by the chain. Generally, in a random walk the proposal distribution is gaussian, in which case this random walk satisfies the symmetry requirement and the algorithm is metropolis. I suppose you could perform a "pseudo" random walk with an asymmetric distribution which would cause the proposals too drift in the opposite direction of the skew (a left skewed distribution would favor proposals toward the right). I'm not sure why you would do this, but you could and it would be a metropolis hastings algorithm (i.e. require the additional ratio term).

How does it differ from the other two?

In a non-random walk algorithm, the proposal distributions are fixed. In the random walk variant, the center of the proposal distribution changes at each iteration.

What if the proposal distribution is a Poisson distribution?

Then you need to use MH instead of just metropolis. Presumably this would be to sample a discrete distribution, otherwise you wouldn't want to use a discrete function to generate your proposals.

In any event, if the sampling distribution is truncated or you have prior knowledge of its skew, you probably want to use an asymmetric sampling distribution and therefore need to use metropolis-hastings.

Could someone give me a simple code (C, python, R, pseudo-code or whatever you prefer) example?

Here's metropolis:

Metropolis <- function(F_sample # distribution we want to sample
                      , F_prop  # proposal distribution 
                      , I=1e5   # iterations
               ){
  y = rep(NA,T)
  y[1] = 0    # starting location for random walk
  accepted = c(1)

  for(t in 2:I)    {
    #y.prop <- rnorm(1, y[t-1], sqrt(sigma2) ) # random walk proposal
    y.prop <- F_prop(y[t-1]) # implementation assumes a random walk. 
                             # discard this input for a fixed proposal distribution

    # We work with the log-likelihoods for numeric stability.
    logR = sum(log(F_sample(y.prop))) -
           sum(log(F_sample(y[t-1])))    

    R = exp(logR)

    u <- runif(1)        ## uniform variable to determine acceptance
    if(u < R){           ## accept the new value
      y[t] = y.prop
      accepted = c(accepted,1)
    }    
    else{
      y[t] = y[t-1]      ## reject the new value
      accepted = c(accepted,0)
    }    
  }
  return(list(y, accepted))
}

Let's try using this to sample a bimodal distribution. First, let's see what happens if we use a random walk for our propsal:

set.seed(100)

test = function(x){dnorm(x,-5,1)+dnorm(x,7,3)}

# random walk
response1 <- Metropolis(F_sample = test
                       ,F_prop = function(x){rnorm(1, x, sqrt(0.5) )}
                      ,I=1e5
                       )
y_trace1 = response1[[1]]; accpt_1 = response1[[2]]
mean(accpt_1) # acceptance rate without considering burn-in
# 0.85585   not bad

# looks about how we'd expect
plot(density(y_trace1))
abline(v=-5);abline(v=7) # Highlight the approximate modes of the true distribution

enter image description here

Now let's try sampling using a fixed proposal distribution and see what happens:

response2 <- Metropolis(F_sample = test
                            ,F_prop = function(x){rnorm(1, -5, sqrt(0.5) )}
                            ,I=1e5
                       )

y_trace2 = response2[[1]]; accpt_2 = response2[[2]]
mean(accpt_2) # .871, not bad

This looks ok at first, but if we take a look at the posterior density...

plot(density(y_trace2))

enter image description here

we'll see that it's completely stuck at a local maximum. This isn't entirely surprising since we actually centered our proposal distribution there. Same thing happens if we center this on the other mode:

response2b <- Metropolis(F_sample = test
                        ,F_prop = function(x){rnorm(1, 7, sqrt(10) )}
                        ,I=1e5
)

plot(density(response2b[[1]]))

We can try dropping our proposal between the two modes, but we'll need to set the variance really high to have a chance at exploring either of them

response3 <- Metropolis(F_sample = test
                        ,F_prop = function(x){rnorm(1, -2, sqrt(10) )}
                        ,I=1e5
)
y_trace3 = response3[[1]]; accpt_3 = response3[[2]]
mean(accpt_3) # .3958!

Notice how the choice of the center of our proposal distribution has a significant impact on the acceptance rate of our sampler.

plot(density(y_trace3))

enter image description here

plot(y_trace3) # we really need to set the variance pretty high to catch 
               # the mode at +7. We're still just barely exploring it

We still get stuck in the closer of the two modes. Let's try dropping this directly between the two modes.

response4 <- Metropolis(F_sample = test
                        ,F_prop = function(x){rnorm(1, 1, sqrt(10) )}
                        ,I=1e5
)
y_trace4 = response4[[1]]; accpt_4 = response4[[2]]

plot(density(y_trace1))
lines(density(y_trace4), col='red')

enter image description here

Finally, we're getting closer to what we were looking for. Theoretically, if we let the sampler run long enough we can get a representative sample out of any of these proposal distributions, but the random walk produced a usable sample very quickly, and we had to take advantage of our knowledge of how the posterior was supposed to look to tune the fixed sampling distributions to produce a usable result (which, truth be told, we don't quite have yet in y_trace4).

I'll try to update with an example of metropolis hastings later. You should be able to see fairly easily how to modify the above code to produce a metropolis hastings algorithm (hint: you just need to add the supplemental ratio into the logR calculation).

Solved – MCMC Metropolis-Hastings’ jumping distribution for non-negative parameters

Let $q(x^*|x_{old})$ be your proposal (jumping) distribution and let $x^*$ be a proposed value from this distribution. Then the Metropolis-Hastings acceptance probability is $$ \rho = \min \left\{ a, \frac{f(x^*)}{f(x_{old})} \frac{q(x_{old}|x^*)}{q(x^*|x_{old})} \right\} $$ and set $x_{new} = x^*$ with probability $\rho$ and otherwise $x_{new} = x_{old}$.

The Metropolis-Hastings algorithm is appealing because you have a wide range of flexibility in specifying $q$. For example, you can use $N(x_{old},\sigma^2)$ in which case you have to remember that the target $f$ for a non-negative parameter includes an indicator function that indicates the parameter must be positive. Thus any negative values for $x^*$ are rejected because $f(x^*)$ will be zero and you set $x_{new} = x_{old}$. This proposal is convenient because it is symmetric, i.e. $q(x_{old}|x^*)= q(x^*|x_{old})$ for all $x^*, x_{old}$. Thus, you never need to calculate the last ratio in the acceptance probability.

Alternatively (as suggested in the comments), you can use a truncated normal, e.g. $N(x_{old},\sigma^2)\mathrm{I}(x^*>0).$ But note that the normalizing constant for this truncated normal depends on $x_{old}$ and thus this proposal is not a symmetric proposal and thus you will need to calculate the last ratio in the acceptance probability.

The choice of setting any proposed value to zero doesn't gain you anything since (I'm guessing) either your data will indicate that the parameter cannot actually be zero and thus $f(x^*)$ will be zero when $x^*=0$. Even if this isn't the case, evaluation of the proposal distribution will be a bit annoying since it is now a mixture of a continuous distribution and a point mass at zero.

There are many other choices for the proposal distribution including independent proposals, i.e. those that do not depend on $x_{old}$. You have not provided enough information for us to give a "better" proposal distribution because this will depend on the target distribution. If this target distribution doesn't have too much mass near zero, then the normal random-walk proposal, i.e. $N(x_{old},\sigma^2)$, will likely work well even though it occasionally will reject a negative proposed value.

Best Answer

Related Solutions

Metropolis-Hastings – Understanding Asymmetric Proposal Distribution

Solved – MCMC Metropolis-Hastings’ jumping distribution for non-negative parameters

Related Question