Solved – Hamiltonian/Hybrid MCMC ‘mass matrix’ terminology

bayesianmarkov-chain-montecarlomonte carlo

I am trying to implement HMC with a non-diagonal mass matrix, but am getting tripped up by some of the terminology.

According to BDA3 and Neal's review, the kinetic energy term (that I guess is always used due to convenience) is

$$ K(p) = \frac{p^T M^{-1} p}{2} \,.$$

This is also recognizable called a multivariate normal with zero mean and covariance matrix $M$. BDA3 (pg 301) says

To keep it simple, we commonly use a
diagonal mass matrix, M. If so, the components of φ are independent, with φj ∼ N(0,Mjj) for each dimension j = 1, . . . , d. It can be useful for M to roughly scale with the inverse
covariance matrix of the posterior distribution, (var(θ|y))^−1.

(I'm reading N(0, M)) as a multivariate normal with mean zero and covariance M.)

The part tripping me up is where says that that "it can be useful for $M$ to roughly scale with the inverse covariance matrix of the posterior distribution…".

And then also just before that the momentum sample that starts the leapfrog steps ($\phi$) is drawn from a multivariate normal with covariance matrix $M$.

So which is it? To construct a good M for HMC, do I estimate the covariance or precision matrix of the posterior? Even though $M$ is the covariance matrix of the kinetic energy, using an $M$ that is an estimate of the precision matrix of the posterior will yield a more efficient algorithm?

Secondary question: what is the intuition that could guide me here?

  • Do you want to use a precision matrix so that the momentum pushes orthogonally to the potential/posterior to improve mixing?

  • OR do you want the momentum to push towards the high probability mass part of the posterior (because thats where you want to draw most samples from).

p.s.
The reason I'm not using the identity matrix for $M$ is because for my problem I happen to able to obtain decent estimate of the covariance matrix of my pretty high dimensional (~1000) posterior beforehand.

Best Answer

A linear transformation of the position variables is equivalent to the inverse linear transformation of the momentum variables. Ideally, you want to sample from a (transformed) distribution whose covariance matrix is the identity matrix, and this is obtained by the transformation indicated above.

For details, there is a nice explanation in Neal's "MCMC using Hamiltonian dynamics", Chapter 5 of the Handbook of Markov Chain Monte Carlo, Section 4.1 ("Effect of linear transformations"). The chapter is available here.

Neal explains:

Suppose we have an estimate, $\Sigma$, of the covariance matrix for $q$, and suppose also that $q$ has at least a roughly Gaussian distribution. How can we use this information to improve the performance of HMC? One way is to transform the variables so that their covariance matrix is close to the identity, by finding the Cholesky decomposition, $\Sigma = LL^T$, with $L$ being lower-triangular, and letting $q^\prime = L^{−1}q$. [$\ldots$]

An equivalent way to make use of the estimated covariance $\Sigma$ is to keep the original $q$ variables, but use the kinetic energy function $K(p) = p^T \Sigma p/2$ — ie, we let the momentum variables have covariance $\Sigma^{−1}$. The equivalence can be seen by transforming this kinetic energy to correspond to a transformation to $q^\prime = L^{−1} q$ (see equation (4.1)), which gives $K(p^\prime) = (p^\prime)^T{M^\prime}^{−1}p^\prime$ with $M^\prime = (L^{−1}(LL^T)(L^{−1})^T)^{−1} = I$.

To give some intuition, suppose that the target pdf is cigar-shaped pointing in one direction which is not axis-aligned. You can either rotate and rescale the space, so that the cigar becomes a ball, and then draw momenta from a unit multivariate normal, or equivalently you can keep the original space and draw your momenta so that they are aligned with the cigar (e.g., with most of the velocity along the major axis of the cigar, so that you can explore it quickly).

Related Question