Solved – Markov chains: how to validate a state-transition probability matrix on new data

confidence intervalmarkov-processr

I have the state – transition data with multiple sequences of unequal lengths, each sequence corresponding to one subject/person:

head(prob_tbl2)

   period1 period2 period3 period4 period5 period6 period7 period8
1    <NA>  active  active  active  active  active  active  active
2  active  active  active  active  active  active  active  active
3  active  active  active  active  active  active  active  active
4    <NA>    <NA>  active lapsed1  active lapsed1  active lapsed1
5    <NA>    <NA>  active lapsed1 lapsed2 lapsed3 lapsed4 lapsed5
6    <NA>  active lapsed1 lapsed2 lapsed3 lapsed4 lapsed5 lapsed6

Based on this data, I created a probability matrix (as suggested here):

trans.matrix <- function(X, prob=T)
{
  tt <- table( c(X[,-ncol(X)]), c(X[,-1]) )
  if(prob) tt <- tt / rowSums(tt)
  tt
}

prob_table= trans.matrix(as.matrix(prob_tbl2))
prob_table

           active    lapsed1    lapsed2    lapsed3    lapsed4    lapsed5    lapsed6    lapsed7
  active  0.56006735 0.43993265 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
  lapsed1 0.19183552 0.00000000 0.80816448 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
  lapsed2 0.10206773 0.00000000 0.00000000 0.89793227 0.00000000 0.00000000 0.00000000 0.00000000
  lapsed3 0.06437215 0.00000000 0.00000000 0.00000000 0.93562785 0.00000000 0.00000000 0.00000000
  lapsed4 0.04468764 0.00000000 0.00000000 0.00000000 0.00000000 0.95531236 0.00000000 0.00000000
  lapsed5 0.04115773 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.95884227 0.00000000
  lapsed6 0.02306733 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.97693267

What is the way I could validate those probabilities, e.g. receive stadard errors/confidence intervals and cross-validate them against a new/unseen dataset? Could I bootstrap my data or apply MLE to get the uncertainty levels, if so, how to do it in R?

I tried to use library(markovchain), with no much luck with my dataset, I described the problem here

Clearly, I'm new to Markov chains so any hints will be helpful!

Best Answer

A general approach is to look at likelihood of data given model parameters, i.e. $P(\text{sequences}|\text{matrix elements})$, or, for practical reasons, log-likelihood. For (not-hidden) Markov model, it is simply $\log L = \sum \log m_{ij}$, where we sum over all transitions that have occurred.

As you see, it does not matter what is the number of sequences, or their lengths (they may be all different). Be warned that a single occurrence of a transition with the matrix element being 0 makes likelihood zero.

In general, you compare it against baselines of uniformly distributed symbols, and symbols distributed according to their observed distribution. See also: perplexity (for measuring performance of sequence prediction).

You can easily compute log-likelihood per transition for both train and validation data.

For a general reference, I recommend:

Measuring uncertainty of this model can be more complicated and depend on the assumptions (as always). Below I mention two general approaches, which can be tailored for Markov model.

Approach 1: Gaussian approximation of likelihood

You need to be careful with the stochastic matrix conditions when differentiating log-likelihood.

See: MacKay's Information Theory, Inference, and Learning Algorithms Chapter IV Probabilities and Inference: Laplace's Method

Approach 2: Bayesian model (with matrix elements being random variables)

Depends on prior (I would go with a flat prior for stochastic matrix), and takes a lot of computational power (Markov Chain Monte Carlo, not to be confused with Markov Chain you are using), but gives interpretable results (and no risk of zero estimates).

Again, MacKay is wonderful at explaining this approach. For the technical part I recommend PyMC3 in Python, especially with this tutorial:

I guess there are some tools in R, but I've never used them.

Related Question