[Math] Intuitive explanation of the spectral gap in context of Markov Chain Monte Carlo (MCMC)

eigenvalues-eigenvectorsmarkov chainsmixingmonte carlo

I'm learning about Markov Chains Monte Carlo methods and mixing times, and could use some help understanding the concept of the Spectral Gap and why / how it relates to the mixing times.

Thus far what I have learned is this: given an ergodic (irreducible, aperiodic) and reversible (e.g. $\pi(x)P(x,y) = \pi(y)P(y,x)$ for all $x,y \in \Omega=$state space) finite Markov Chain and its transition matrix $P$, the spectral gap is defined as:

$$\xi = 1 – \lambda_\max$$

where $\lambda_\max = \max\{|\lambda_2|, |\lambda_n|\}$. Alternatively, the spectral gap can be understood as the smallest positive nonzero eigenvalue (I think?)

I vaguely get that this is somehow related to the mixing time of the Markov chain (that the bigger the spectral gap, the faster the mixing time), but I lack an intuitive fundamental understanding of why this is the case. My linear algebra knowledge is sketchy at best, so I'd very much appreciate an explanation as if I were a 5-year-old of why the eigenvalues are bounded the way they are (between 1 and -1), why the negative eigenvalues don't really concern us for purposes of MCMC, why we care about $\lambda_\max$, and how it is that the spectral gap is indicative of the mixing time.

Best Answer

Strictly speaking, MCMC means constructing a Markov chain which has some desired property (usually either a prespecified equilibrium distribution or a prespecified expectation) and then numerically drawing samples from it. If you are given a Markov chain it is really just called Monte Carlo simulation.

Anyway, at each fixed time $t$, the value of the chain will have some given distribution, given by $p_0 P^t$, where $p_0$ is the initial distribution (written as a row vector) and $P$ is the transition probability matrix (in the row-stochastic convention i.e. $\sum_{j=1}^n P_{ij}=1$).

So the long term behavior of the distributions at each fixed $t$ is determined by the behavior of $P^t$ as $t \to \infty$. Any stochastic matrix has all eigenvalues of modulus at most $1$. It also has $1$ as an eigenvalue. The ergodicity assumption additionally tells us that the eigenvalue $1$ has multiplicity $1$ and that all of its other eigenvalues will have modulus strictly less than $1$. (Without irreducibility it is possible that the eigenvalue $1$ has a higher multiplicity; without aperiodicity it is possible to have an eigenvalue like $-1$ or $e^{2\pi i/3}$.) The reversibility assumption tells you (among other things) that $P$ is diagonalizable with real eigenvalues (because reversibility furnishes a similarity transformation which turns $P$ into a symmetric matrix).

As a result of these three facts you can write

$$p_0 P^t = \pi + \sum_{i=2}^n c_i \lambda_i^t q_i$$

where $\lambda_i$ are eigenvalues all less than $1$ in absolute value, $q_i$ are the left eigenvectors, and $c_i$ are coefficients depending on $p_0$. This is just what you get from diagonalizing $P^t$ (without going through and computing the $c_i$ since they do not really matter for this presentation).

You can see that as $t \to \infty$, the remaining eigenvalues will decay because of the power of $t$. Assuming the $c_i$ are all significant (the typical situation for a "random" initial distribution), the slowest decay is by $\lambda_{\max}$, since it is closer to $1$ in absolute value. So the larger the spectral gap $\xi$, the faster the convergence. Note that the negative eigenvalues do matter, which is why you wrote it as $\lambda_{\max}=\max \{ |\lambda_2|,|\lambda_n| \}$.

The precise relationship between the spectral gap and the dynamical properties of the chain is more subtle than the rest of what I've said here. Loosely speaking when the spectral gap is small, the eigenvector corresponding to $\lambda_{\max}$, which I might call a "metastable eigenvector", will be significant and positive for one set of states and significant and negative for another set of states. It might also be smaller on a third set of states.

The "metastable eigenvector" then corresponds to the slow exchange of probability between the first two sets of states (because the decay of its contribution to the distribution results in a flow from one set to the other, with which set is which depending on the sign of the corresponding $c_i$). You can look up metastability in Markov chains for more about this. To get a feel for it, consider a matrix like

$$P=\begin{bmatrix} 1/3 & 2/3 & 0 & 0 \\ 1/4-\epsilon/2 & 3/4-\epsilon/2 & \epsilon & 0 \\ 0 & \epsilon & 1/5-\epsilon/2 & 4/5-\epsilon/2 \\ 0 & 0 & 1/6 & 5/6 \end{bmatrix}$$

where $0<\epsilon \ll 1$ is a small parameter. For $\epsilon=10^{-6}$ for example, this has relatively rapid equilibration within the sets $\{ 1,2 \}$ and $\{ 3,4 \}$ (driven by eigenvalues of about $1/12$ and $1/30$ respectively) and slow equilibration between the two of them (driven by an eigenvalue of about $1-10^{-6}$).

Related Question