Computational Physics – Meaning of ‘Thermalization’ in Markov Chain Monte Carlo Simulations

computational physicsising-modelsimulations

In performing MCMC simulations, it is standard practice to 'equilibriate' or 'thermalize' the system and then discard the initial data before useful sampling is done.

My question is about the concept of thermalization itself, not its implementation. In any such MCMC simulation, discarding the first few states amounts to choosing a new initial condition for the Markov Chain. Loosely, what are the features that we expect this initial condition to possess?

I have some ideas:

It is not an unlikely configuration. Starting from an unlikely configuration could bias the first several measurements heavily, especially in local update algorithms.
The update algorithm should be able to 'easily transit' out of the state. This is worded poorly, but what I mean is that the algorithm shouldn't be stuck in this state for a long time. That would also bias the measurements.

So it looks like being highly likely/highly unlikely is bad. In the example of the Ising model, this means neither the 'chessboard' configuration nor the all-aligned configuration is a good starting point. I surmise that states having energy closer to the average energy would be better for the job.

Am I on the right track here? If the 'average energy' conclusion works, I don't understand how the thermalization procedure is supposed to return such a state with high likelihood (unless the thermal variance of the energy is small (?)).

I couldn't find any references discussing this (at least the way I've worded it). Sorry if I'd missed something obvious.

Edit: The bulk of my confusion came from thinking of each step of the Markov chain as a single configuration rather than a random variable, a distinction pointed out by Yvan Velenik in his answer below.

Let $X_t$ be the random variable corresponding to step $t$ of a chain. Starting a chain with configuration $S_0$ is not the same as equilibriating a chain after $\tau$ steps and discarding the first $\tau$ (even if one obtains $S_{\tau} = S_0$). The latter chain is expected to produce samples according to the stationary distribution $D$ because $X_{\tau} \sim D$, but the former isn't, because $X_0 \sim \delta(S_0)$, where $\delta(S_0)$ is the distribution in which $P(S_0) = 1$. This is what I'd missed.

A helpful reference is Alan Sokal's notes on Monte Carlo methods, where he also makes the useful distinction between the exponential and integrated autocorrelation times.

Best Answer

One does not discard "the first few states amounts to choos[e] a new initial condition". One discards them because the goal of the simulation is to sample configurations from the Gibbs measure.

The Markov chain is constructed in such a way that its unique stationary distribution is given by the Gibbs measure. The ergodic theorem for Markov chains then guarantees that if one lets the chain run for infinitely many steps, then it will be found in a given configuration with a probability given by the Gibbs measure.

Of course, one cannot let the chain run for an infinite time, so one stops it after a finite number of steps and one hopes that one has waited long enough for the chain to be distributed approximately according to its stationary distribution.

A second, closely related, reason for discarding steps is that, even if the chain has reached equilibrium, you have to wait long enough between measurements if you want the latter to be approximately independent. More accurately, if you wish to compute the expected value of some observable under the Gibbs measure by averaging along a trajectory of the Markov chain, then you need a long piece of the trajectory, because the chain has to explore a large part of the state space (this is less of a problem for macroscopic observables, since the latter usually take nearly constant values over most of the state space, as measure by the Gibbs probability measure). Measuring the observables over successive steps will give you highly correlated values, not much better than what you get from a single measurement. The ergodic theorem does again guarantee that the average over an infinite trajectory of the Markov chain will almost surely coincide with the expectation under the Gibbs measure. But you only have access to a finite piece of the trajectory, so it has to be long enough for the approximation to be reasonable...

That being said, you can indeed accelerate convergence (to a good approximation of the equilibrium measure) by starting from a configuration that is not too far from equilibrium. For instance, if you consider the Ising model with a small positive magnetic field, but start with a configuration consisting entirely of $-$ spins, then the time to relax will be much longer than if you start with a configuration consisting of only $+$ spins. Indeed, the former will first relax to the metastable - phase and you will have to wait for the creation of a supercritical droplet in order to reach true equilibrium.

In particular, it is true that if you start with a configuration sampled from the Gibbs measure, then your chain is immediately distributed according to the stationary distribution (that's precisely why it is called a stationary distribution). In that case, you don't have to wait for thermalization (you are already distributed correctly), but you still have to wait between successive measurements for the reasons explained above. Of course, the whole point is that you need the MC to sample the initial configuration according to the Gibbs measure, so such an approach is practically useless (and is replaced by the initial run of the Markov chain).

Let me mention, to conclude, that there exists a special type of implementation of MCMC algorithms, the so-called perfect simulation algorithms, that run for a random but finite amount of time and are guaranteed to output a configuration distributed exactly according to the Gibbs measure. But such algorithms are not always available in a form that is computationally efficient. The best known such algorithm is coupling from the past.

Related Solutions

[Physics] Critical slowing down in Monte Carlo (MC) simulations

1) Near a second order phase transition the correlation length is very large, and there are fluctuations on all scales. This means that a local algorithm will have difficulty in sampling the space of relevant configurations efficiently. The mean magnetization may actually be more or less correct, but more complicated observables (higher moments of M, correlation functions, etc) are difficult to compute.

2) Consider the Landau theory of phase transitions. The free energy functional is $$ F = \int dx \left[ \kappa(\nabla M)^2 + a M^2 + b M^4 + \ldots \right] $$ and the correlation length (in the disordered phase) is $\xi \sim 1/\sqrt{a}$. At the phase transition $a\to 0$ and $\xi\to\infty$. Fluctuations modify the simple mean field scaling, and $\xi\sim 1/t^\nu$ where $t$ is the reduced temperature and $\nu$ is a critical exponent.

3) In order to study the relaxation time we have to make the Landau theory time dependent. This leads to a hydrodynamic theory known as model A. The equation of motion is $$ \partial_t M = -D \frac{\delta F}{\delta M} $$ which has eigenmodes of the form $\omega_k=iD(k^2+a)$. Usually the relaxation time is finite, but near the critical point $a\to 0$ and $\omega_k\sim ik^2$, so that modes with wave number $k\sim 1/\xi$ relax over a time $\tau \sim \xi^2$. Again, this is mean field and a more sophisticated analysis gives $\tau \sim \xi^z$ with a critical exponent z.

4) Cluster algorithms perform updates on all scales, and capture the physics in 2),3) better.

Python Simulations – How to Perform Monte Carlo Simulations

Are you sure your transformation is correct? You are using the single random variable $a$ to generate points $\phi,\,\lambda$, which means that you are tracing a path along a log-polar coordinate system (not 2D cylindrical, as you are using $\log(a)$ instead of $\sqrt{a}$). For instance, running JG's code (but using scatter() instead of plot()) I get,

Which seems wrong to me, as you are doing a transformation $\mathbb{R}^1\to\mathbb{R}^2$. I would think you need a transformation of $\mathbb{R}^2\to\mathbb{R}^2$ in order to plot a randomly/uniformly distributed series of points on the domain. This would be done by using two arrays of random values,

import matplotlib.pyplot as plt
import numpy as np

N = 1000
a = np.random.uniform(0, 1, size=N)
b = np.random.uniform(-1, 1, size=N)
lam = -np.log(a)
phi = np.arccos(b)
plt.scatter(lam * np.cos(phi), lam * np.sin(phi))
plt.plot()

which gives a plot more like,

Note that using $\phi=\cos^{-1}(x)$ for $x\in\operatorname{Uniform}(-1,1)$ is for removing biases from sampling $\operatorname{Uniform}(0,\,2\pi)$ (cf. this site, among others for sampling the hemisphere).

Best Answer

Related Solutions

[Physics] Critical slowing down in Monte Carlo (MC) simulations

Python Simulations – How to Perform Monte Carlo Simulations

Related Question