[Math] Understanding the random variable definition of Markov chains

markov chainsprobabilitystochastic-processes

Update This question is answered in section 3.2 of these notes.


As a probability novice, I'm struggling to completely understand the definition of a Markov chain as a sequence of random variables.

For simplicity, consider discrete-time, homogeneous Markov chains with finite state spaces $S$ which we take to be $\{1, \dots, n\}$. I understand the following definition of a Markov chain in this context:

A Markov chain is a pair $(S,P)$ where $P = (P_{ij})$ is a transition matrix.

Given this definition, one can generate a trajectory of the Markov chain consisting of an infinite sequence $s_0, s_1, \dots$ of elements of $S$ by initializing in a state $s_0$ and evolving forward according to the transition probabilities $(P_{ij})$.

So far so good — this is all quite intuitive.

However, suppose instead that one considers a Markov chain as a sequence $X_0, X_1, \dots$ of random variables with values in $S$ having the Markov propertyis there a standard mapping between these definitions? In particular, since each $X_k$ is a random variable whose domain is some sample space $\Omega$;

$$
X_k:\Omega\to S,
$$

and since the transition probabilities are usually described as conditional probabilities;

$$
P_{ij} = \mathbf P(X_k = i\mid X_{k-1}=j),
$$

or the transpose of this depending on you conventions, presumably there is a sample space $\Omega$ and a probability measure $\mathbf P$ sitting somewhere?

One guess would be that $\Omega$ can be taken to be the set of all sequences $s_0, s_1, \dots$ of elements of $S$, the random variable $X_k$ maps any such sequence to its $k^\mathrm{th}$ element,

$$
X_k(s_0, s_1, \dots) = s_k,
$$

and $\mathbf P$ is any probability measure on the set of subsets of $\Omega$ that satisfies the Markov property.

Is this description of (one direction of) the correspondence between these Markov chain definitions correct and/or standard?

Best Answer

I'm not sure this completely answers your questions and I'm definitely no expert, but it's the best I could do. Sorry it's a bit long and meandering but I do get to your questions eventually.

Yes, there is always a state space sitting behind a random variable $X$, but as you've noticed it's usually suppressed since it in some sense describes the "world" and all possible states the world might be in, and is thus way too complicated to deal with directly.

Our goal in defining a random variable $X:\Omega\rightarrow\mathbb{R}$ is in a very real sense to simplify our model of the "world" by mapping many states to the same outcome. For instance if $X$ is $0$ or $1$ depending on if my coin flip is heads or tails, this outcome will be, presumably, independent of whether it rains tomorrow, or the political dynamics in China, etc.

To be more explicit, since the coin flip is a 50/50 chance and independent of whether or not it rains tomorrow, that means the set of states where it rains will overlap evenly (in terms of the measure of the two overlaps) with the two disjoint coin flip sets which partition our state space, moreover by independence alone, the overlaps will be such that conditioning on heads will not change the probability of rain. That is to say $$P(\text{rain}) = \frac{P(\text{rain} \cap \text{heads})}{P(\text{heads})}$$

Thus the state space is over-layed with this unimaginably complex collection of sets (measurable events) with all sorts of intricate patterns of overlapping symmetries.

Fortunately in simplifying our view of the "world" via a random variable $X:\Omega\rightarrow\mathbb{R}$, $\mathbb{R}$ itself inherits the property of being a probability space, where we measure the probability of an event $E\subseteq\mathbb{R}$ by retracting back to $\Omega$ as follows: $$P(E) := {\bf P}(\omega\in\Omega : X(\omega)\in E)$$

Thus not only can we suppress the original state space $\Omega$, but we can even suppress the original random variable if we want to, or in other words we could define a new random variable $Y:\mathbb{R}\rightarrow\mathbb{R}$ which is just the identity function. Furthermore under some mild assumptions we can write this inherited measure as the Lebesgue integral of a non-negative function $f$ with total measure $1$, which is just our familiar pdf. This is why $\mathbb{E}Y$ is just $\int_{\mathbb{R}}yfd\mu$.

The important thing for people is this is where most modeling begins, not with $X$, but with $Y$. This is why in applications we so often begin our model by giving a distribution/density function which tells us $P(E)$ directly, and then we just wave our hands about how this "in theory" retracts back to some a priori consistent but ultimately unknowable state space $\Omega$.

Thus for your finite state space Markov chain, it's important to clarify that $s_0,s_1,...$ are not states of your Markov chain, they are the evolving inherited probability mass functions on $\mathbb{R}$. Where your transition matrix tells you how to go from one pmf to the next. Of course behind each $s_i$ is a random variable $X_i$ which takes value $n$ with probability $s_{in}$ and has an underlying state space $\Omega$ such that $$P(X_i=n) := {\bf P}(\omega\in\Omega : X(\omega)\in\{s_{in}\})$$

Hence I want to make it clear that in the transition-matrix Markov chain formulation both the state space $\Omega$ and the sequence of random variables $\{X_i\}$ are suppressed and you're instead just watching the evolution of the sequence of associated probability mass functions $\{s_i\}$ which tell you the probability that your Markov chain is at a particular state/node at time $i$.

Your second formulation $P_{ij}=P(X_k=i \mid X_{k-1}=j)$ is thus equivalent to setting $s_{k-1}$ equal to the standard basis vector $e_j$ and then checking the value at the $i^{th}$ coordinate of the vector $P^{T}e_j$; this of course will be equal to $P_{ij}$.

When you wrote $X(s_0,s_1,...)$ I think what you might be thinking about is the probability of various sequences of states of your Markov chain, which is something different.

Finally, as for what $\Omega$ is I wouldn't worry about it, if you really want to you can play around with some finite set $\Omega$ and try overlapping some sets in various ways to see how the concept of independence is really about the way sets overlap in a sort of symmetric cascading like fashion.

Related Question