Probability – How Long Does It Take to Reach a Limiting Distribution

For Markov Chains, there are well known formulae (e.g. fundamental matrix approach, first step analysis) which show us how to calculate the Expected Time to Absorption (e.g https://en.wikipedia.org/wiki/Absorbing_Markov_chain). That is, given a transition matrix and the current state – how many iterations (i.e. steps) are required (on average) for the chain to reach a certain state.

In a Markov Chain, we can define the following:

Define the Limiting Distribution $\pi(n)$ as:

$$ \lim_{n \to \infty} \pi(n) = \lim_{n \to \infty} \pi(0) P^n $$

Where:

$\pi(0)$ is the initial distribution.
$P$ is the transition probability matrix of the Markov chain.

Define the Stationary Distribution $\pi$ as:

$$\pi P^n = \pi$$

Where:

$P$ is the transition probability matrix of the Markov chain.

Based on these concepts, I have the following question:

Just as we have formulae to calculate the Expected Time to Absorption – given a Markov Chain with a given Transition Matrix and initial Distribution, can we derive analytical formulae to calculate the Expected Time to reach the Limiting Distribution and the Expected Time to reach the Stationary Distribution (provided these distributions exist)? How can this be done?

Note 1: My idea of doing this would be to first determine the Limiting Distribution and Stationary Distributions (if they exist) by solving the respective equations. Then, I would use the absorption Markov formulae to determine the number of steps needed to reach these distributions. Is this correct?
Note 2: The Limiting Distribution is more "powerful" than the Stationary Distribution. If a Limiting Distribution exists, then the Stationary Distribution follows the same distribution. However, if the Stationary Distribution exists, the same can not be said about the Limiting Distribution.
Note 3 (Stationary Distributions): I know that Stationary Distributions are generally written as $\pi P = \pi$, but I prefer writing $\pi P^n = \pi$ as it is more logical to me. If you happen to "stumble into" a Stationary Distribution – you will be there forever (i.e. still be there even after $n$ steps).

I believe that both formulations are equivalent – see below:

$$\pi P = \pi$$
$$\pi P^2 = \pi P = \pi$$
$$ \pi P^3 = \pi P^2 P = \pi P^2 = \pi$$
$$\pi P^n = \pi$$

References:

[3] On the Importance of the Second Largest Eigenvalue on the Convergence Rate of Genetic Algorithms by Florian Schmitt and Franz Rothlauf https://dl.acm.org/doi/pdf/10.5555/2955239.2955324

Best Answer

First, any existing limiting distribution must be stationary, see for instance [1] for a more detailed discussion of this. Conversely, there may exist a stationary distribution but no limiting distribution for a Markov chain. Hence, let's consider convergence towards the stationary distribution.

Second, Markov chain has a unique stationary distribution $\pi$ if and only if it is irreducible and all of its states are positive recurrent. (Any finite-state irreducible chain is positive recurrent.) Assume that this holds, and thus $\pi$ is a stationary distribution that is unique. By definition, $\pi$ is stationary if it is unaffected by the action of transition matrix $P \in \mathbb{R}^{n \times n}$, that is: $$ P^{T} \pi = 1 \pi $$

this implies that $\pi$ is an eigenvector of $P$ with corresponding eigenvalue 1. Since $\pi$ is unique, the eigenvalue 1 is also unique. It can be shown that all other eigenvalues have norm strictly less than 1: $$ \lambda_{1} = 1 > |\lambda_{2}| \geq \dots \geq |\lambda_{n}|. $$

See for example Chapter 2 in [2] for an introduction to discrete time Markov Chains. Note that,

$n$ usually stands for the size of the state space;
and $t = 0, 1, \dots$ is used for time, i.e. the number of steps;
$p_{i}(t)$ is the probability to be in a given state $i = 1, \dots, n$ at time $t$;
and $p(t) \in \mathbb{R}^{n}$ is the entire distribution at time $t$.

Now, assuming that transition matrix $P$ is diagonalizable, it can be decomposed as: $$ P = \begin{pmatrix} u_{1} & \dots & u_{n} \end{pmatrix} \begin{pmatrix} \lambda_{1} & & 0 \\ & \ddots & \\ 0 & & \lambda_{n} \end{pmatrix} \begin{pmatrix} v_{1}\\ \vdots\\ v_{n} \end{pmatrix} . $$

and we can write: \begin{align*} p(t)^{T} &= p(0)^{T} P^{t} \\ &= p(0)^{T} \sum_{i = 1}^{n} \lambda_{i}^{t} u_{i} v_{i}^{T} \\ &= p(0)^{T} \lambda_{1}^{t} u_{1} v_{1}^{T} + p(0)^{T} \sum_{i=2}^n \lambda_{i}^{t} u_{i} v_{i}^{T} \\ &= p(0)^{T} 1 \pi^{T} + p(0)^{T} \sum_{i=2}^n \lambda_{i}^{t} u_{i} v_{i}^{T} \\ &= \pi^{T} + p(0)^{T} \sum_{i=2}^n \lambda_{i}^{t} u_{i} v_{i}^{T} \end{align*}

Therefore, $\lim_{t \rightarrow \infty} p(t)^{T} = \pi^{T}$, since $|\lambda_{i}| < 1$ for all $i > 1$, and all other terms cancel out in the limit.

The rate of convergence towards the stationary distribution. Consider the following error: \begin{align*} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} E(t) &= \norm{\pi^{T} - p(t)^{T}} \\ &= \norm{\pi^{T} - \left(\pi^{T} + p(0)^{T} \sum_{i=2}^n \lambda_{i}^{t} u_{i} v_{i}^{T}\right)} \\ &= \norm{p(0)^{T} \sum_{i=2}^n \lambda_{i}^{t} u_{i} v_{i}^{T}} \\ &= \norm{p(0)^{T} \lambda_{2} u_{r} v_{2}^{T} + p(0)^{T} \sum_{i=3}^n \lambda_{i}^{t} u_{i} v_{i}^{T}} \\ &\approx |\lambda_{2}^{t}| \norm{p(0)^{T} u_{r} v_{2}^{T}} \end{align*} This implies that the convergence rate is determined by the second largest eigenvalue. Since the error $E(t)$ decays exponentially with $|\lambda_{2}^{t}|$. The rate of decay is $- \ln(|\lambda_{2}|)$ and timescale is $- 1 / \ln(|\lambda_{2}|)$. In other words, the number of steps needed to come close to the stationary distribution is at least $- 1 / \ln(|\lambda_{2}|)$. Note also that if you start with initial distribution $p(0) \approx \pi^{T}$, then convergence is immediate.

Finally, if $P$ is not diagonalizable, one needs to start with the Jordan normal form of $P$ to arrive at a similar result; see [3].

[1] Convergence Rate of Markov Chains by Jeffrey S. Rosenthal. http://probability.ca/jeff/ftpdir/eigen.pdf

[2] Lecture Notes on Stochastic Processes by Frank Noé, Bettina Keller and Jan-Hendrik Prinz https://www.mi.fu-berlin.de/wiki/pub/CompMolBio/MarkovKetten15/stochastic_processes_2011.pdf

Best Answer

References:

Related Solutions

Related Question