Solved – When is blocked Metropolis sampling more efficient

markov-chain-montecarlometropolis-hastingssampling

Consider the problem of sampling from $p(\mathbf{x}, \mathbf{y})$ using the Metropolis or Metropolis-Hastings (MH) algorithm. I can either propose samples for $p(\mathbf{x}, \mathbf{y})$ directly, or I could do a blocked version of that and alternately propose samples for $p(\mathbf{x} \mid \mathbf{y})$ and $p(\mathbf{y} \mid \mathbf{x})$, which I believe is also called Metropolis-within-Gibbs.

I am trying to better understand under which conditions the latter will be more efficient (in a statistical sense; let's ignore computational considerations for the moment).

For example, in (blocked) Gibbs sampling, where the acceptance probability is always 1, it will always be better to directly sample from $p(\mathbf{x}, \mathbf{y})$, because then we get a perfect sample already in the first step, whereas sampling the conditionals might take several iterations through the variables before we get a good sample.

However, if I have a less perfect proposal distribution and only weak dependencies between $\mathbf{x}$ and $\mathbf{y}$, then the situation might be different. Intuitively, if I try to reject/accept both of them together, then both of them will have to be good proposals so that I can accept them. So it might make sense to accept the proposed $\mathbf{x}$ and reject the proposed $\mathbf{y}$.

I did a little experiment where I assumed that $p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x})p(\mathbf{y})$ is Gaussian and the proposal distribution $q(\mathbf{x}, \mathbf{y}) = q(\mathbf{x})q(\mathbf{y})$ is also Gaussian. I then either tried to accept both samples at once (jointly) or independently. The different lines in the figure below are for different variances of the proposal distribution. Accepting them independently appears to be more efficient.

I would like to back up my intuition and numerical experiments with some theory. How would you approach proving that one sampling strategy is more efficient than another? Pointers to some references where this is discussed will be appreciated. Thanks!

Autocorrelation

And here's the essential part of the code for creating the figure (here X and Y mean something else than $\mathbf{x}$ and $\mathbf{y}$).

### JOINT ACCEPT/REJECT

X = randn(dim, num_samples)

for i in range(1, num_samples):
    # propose step
    X[:, i] = X[:, i - 1] + sigma * X[:, i]

    # accept/reject step
    if rand() > exp(0.5 * (sum(square(X[:, i - 1]), 0) - sum(square(X[:, i]), 0))):
        X[:, i] = X[:, i - 1]

### INDEPENDENT ACCEPT/REJECT

Y = randn(dim, num_samples)

for i in range(1, num_samples):
    # propose steps
    Y[:, i] = Y[:, i - 1] + sigma * Y[:, i]

    # accept/reject steps
    for j in range(dim):
        if rand() > exp(0.5 * (square(Y[j, i - 1]) - square(Y[j, i]))):
            Y[j, i] = Y[j, i - 1]

Best Answer

I don't have references on hand, unfortunately, but generally the more dependence there is between $X$ and $Y$, the more efficient it will be to update them jointly. In your example $X$ and $Y$ are independent; if, for example, $X$ and $Y$ were still jointly Gaussian but had a correlation of .99, updating them separately would be much worse. Updating them jointly will also not be very efficient unless your proposal distribution has roughly the same shape as the distribution you're sampling from, which motivated the development of adaptive MCMC methods which change the proposal distribution over time. The best-known paper on this is Haario et al. (2001), but there has been significant work in the area since then. With a good proposal distribution (either by making it adaptive or just by having a lucky guess), joint updating in the highly correlated case is just as efficient as joint updating in the independent case, while individual updating is still horribly slow.

Related Question