How does Markov Chain coupling work

couplingmarkov chainsprobability

I'm trying to understand the coupling argument used in proof of Theorem 5.2 in Finite Markov Chains and Algorithmic Applications by Häggström. My problem is actually more general than this particular situation, hence the generic question title.

Before asking the question, I will build up some context.

Theorem 5.2

For any aperiodic and irreducible MC $(X_0, X_1, …)$ with state space $S = \{s_1, s_2, … ,s_k \} $ we have that for any distribution $\pi$ that is stationary and any arbitrary distribution $\mu^{(0)}$ we have

$$\lim_{n\rightarrow\infty}{d_{TV}(\mu^{(n)}, \pi) = 0}$$

So to prove this, we first define two Markov Chains – $X = (X_0, X_1, …)$ with initial ditribution $\mu^{(0)}$ and $X' = (X'_0, X'_1, …)$ with initial distribution $\pi$. Since $\pi$ is stationary, we get that $X'$ has distribution $\pi$ for any $n$.

We also need to define $T = min\{n: X_n = X'_n \}$. We also prove that $T$ is finite with probability one, but this part is fine for me, so I'll skip.

Then the problem begins. We define the third Markov Chain $X''$ such that

$X''_n = X_n$ for $n < T$ and $X''_n = X'_n$ for $n \ge T$.


Question 1:
Why is $X''$ a legitimate Markov Chain? On one hand I understand that it fulfills the property of memorylessness, but in MCs we have that distribution of the next step $\mu^{(n+1)}$ is just $\mu^nP$, where $P$ is the transition matrix. Doesn't this condition break at time $n = T$?

Question 2:
It's said that since $X''_0$ has distribution $\mu^{(0)}$, then for any $n$, $X''_n$ has distribution $\mu^{(n)}$. I don't see it. I understand that it works for $n \le T$ but I don't get why it should hold after that time. I mean, for $n = T$ we get there with $\mu^T$ and the next step is now with distribution $\pi$. It just doesn't seem like $\mu^{(T+1)}$.


In the proof we then proceed and show that $\mu^{(n)} – \pi = P(T > n)$. It's fine if you trust the claims I asked about in the questions above. This and one more thing actually completes the proof.

But I still have one more question:

Why does it prove anything? We artificially crafted a Markov Chain that has a certain property and this proves something about the distribution in general? Why is it so?

Best Answer

Here is a more "algorithmic" way of describing the coupling.

Define two Markov chains $Y$ and $Y'$ by initializing $Y_0$ from distribution $\mu^{(0)}$ and $Y_0'$ from distribution $\pi$. Then, choose the next states by the following algorithm:

  1. If $Y_n \ne Y_n'$, then choose $Y_{n+1}$ and $Y_{n+1}'$ independently, following the transition rules of the Markov chain.
  2. If $Y_n = Y_n'$, then choose a single value following the transition rules in the Markov chain, and set both $Y_{n+1}$ and $Y_{n+1}'$ equal to that value.

Then it's clear that if we just look at $Y_n$ and ignore $Y_n'$ entirely, we get a Markov chain, because at each step we follow the transition rules. Similarly, we get a Markov chain if we look at $Y_n'$ and ignore $Y_n$. But we do not get independent Markov chains.

The $X_n, X_n', X_n''$ thing that "crosses over" at time $T$ is just a non-algorithmic way of building this coupling. In the end, $(X'',X')$ have the same joint distribution as the $(Y,Y')$ defined above. In terms of $Y$ and $Y'$, we can set $T$ to be the first $n$ for which $Y_n = Y_n'$.

If we work with $Y$ and $Y'$ then it's easy to answer your questions:

  1. $Y$ and $Y'$ are both legitimate Markov chains because, individually, each of them follows the transition rules.
  2. We can prove that $Y_n$ has distribution $\mu^{(n)} = \mu^{(0)}P^n$ for all $n$ by induction (if we ignore the existence of $Y'$ there is literally nothing unusual going on). Similarly, $Y_n'$ has distribution $\pi$ for all $n$ by induction.

Why does this prove anything? It proves something about the distributions $\mu^{(n)}$ and $\pi$ because we constructed $Y$ and $Y'$ so that $Y_n$ has distribution $\mu^{(n)}$ while $Y_n'$ has distribution $\pi$.

For all states $i$, the probability $\Pr[Y_n \ne Y_n']$ is at least $|\mu^{(n)}_i - \pi_i|$. We know that $\Pr[Y_n \ne Y_n'] = \Pr[T>n] \to 0$ as $n \to \infty$, so $|\mu^{(n)}_i - \pi_i| \to 0$ as $n \to \infty$.