Solved – Validate a Markov Chain with few states (model switching)

markov chaintime seriesvalidation

Suppose I have a five state Markov chain. The states are observable (in fact I defined them, they are outcomes of an classification algorithm). So I have a long time series, see the first picture.

The states correspond to a real stochastic situation, it represents the environment for a car driver, e.g., the highway, motorway, city etc. The data comes from several independent drivers and I think this should be “purely” stochastic, so I can’t use any regression method, correct? So I chose the (time discrete) Markov chains. My goal is to create a model-switching Markov chain.
In each state there is another “model” to predict the speed of the car. The prediction is done via sampling from the speed distribution which has different parameters for each Markov-chain state.

I’ve read about hidden Markov chains with autoregressive models, but here I don’t have hidden states.

the real time series

Now I try to estimate the transfer-probability matrices via a maximum-likelihood estimation. The TPM for example is

0.85393     0.033708     0.078652     0.033708            0
0.11111      0.66667      0.11111     0.055556     0.055556
    0.2          0.1      0.66667     0.033333            0
0.11905            0      0.02381      0.83333      0.02381
      0            0            0          0.5          0.5

I calculate the realization, according to Handbook of MCM, chapter 3 and two possible timeseries are:

realization 1
realization 2

Fine, does not really look like the original time series. Sure, it’s stochastic. So, how do I prove, that these series represent the real time series? I think that looking at the time series is not sufficient. If I had many states, I would go for a histogram. I tried the autocorrelation function to see, if there are dependencies, but I'm not sure, if there are ways to validate my Markov Chain with the auto- or crosscorrelation function.

Or is the only possibility to use the states and predict the vehicle speed and validate it via MSE or similar? But again there is the problem, that it is not a “real” time series, since it is stochastic.

Best Answer

What you have to verify is your model, not your time series (though sample time series from your model can be used for this).


For example, you are assuming that your data is described by a first-order Markov chain, i.e., the probability of observing a given state only depends on the past state or, with other words, the process has a memory of 1. You can substantiate this assumption by showing that the process does not have a longer memory.

As an example, consider a Markov chain with only two states, called A and B. Let’s say that you have counted the subchains of length 3 with A in the middle and got:

  • A→A→A: 723 times
  • A→A→B: 387 times
  • B→A→A: 233 times
  • B→A→B: 130 times

If the Markov chain is of order 1, the proportions of subchains ending with A should not depend on whether the chain started with A or B. This is a general statistical task, which is often represented in form of a contingency table. The corresponding contigency table for this case would be:

               prior state
                 A    B   total
––––––––––––––––––––––––––––––––
next state A |  723  233   956
next state B |  387  130   517
total        | 1110  363  1473

This can, e.g., be tested with Fisher’s exact test, which for this data should yield that the next state and the prior state are independent and thus the state A in our example Markov chain does not preserve any memory of the preceding state (state B hower might do this, so we have to test it as well).

As you have six states, you would have to perform six such tests with 6×6 contingency tables. Be aware that this is multiple testing. Also, in theory your data can pass this test but have an even longer memory, but given your data, I consider this unlikely.


Another way would be comparing some statistical properties of your original data with your simulated ones (which should be the same), for example the distribution of state durations or waiting times between two instances of a state. Should these differ, there is something wrong about your model. (However, your model can still be wrong if the distributions comply.)

Of course you could go one step further and use your model to get a simulated time series of vehicle speeds and compare appropriate properties such as the autocorrelation or the frequency content (FFT) with your original data. Being stochastic makes the results somewhat different from those for regular time series, but they are still meaningful and should be comparable if your model is good.

Either way, the more independent properties you can reproduce, the better – but you will there may always be aspects your model fails to reproduce despite complying to the original data in many aspects. Be also careful that you do not use aspects for verification that your model fulfills by construction. For example the histogram for the individual Markov-chain states of your model will always comply with the one for the original data, as it’s determined by the transition rates.


Finally keep in mind that no model is perfect and that it may still be good for what you are doing despite failing to comply with the original data. For example if there is some low frequency content in your original data but your model fails to reproduce this, this may be alright, if what is important for your application happens on much smaller time scales.

Related Question