Time Series – Understanding Autocorrelation of Discrete Time Series

autocorrelationcategorical datacorrelationpearson-rtime series

I am currently planning on calculating the autocorrelation for various lags given a time series. However, my elements of the time series are "discrete" and abstract classes; i.e., no integers.

For example, my series could look like:

class A, class B, class A, class A …

Of course I could now say that class A = 1 and class B = 2 and then use the common Pearson correlation way for determining the autocorrelation. However, this seems to be problematic to me, because the value of the number does not express anything.

Is there some way around this?

Best Answer

With nominal categories, the concept of autocorrelation can be a little tricky, but there are ways to measure the tendency of values from the different categories to occur together.

If we consider observations from classes A and B, it's relatively easy to calculated a tendency of values in a class to 'clump together' (a kind of positive association) or avoid each other (a kind of negative association) in comparison to a pure independent identically distributed sequence.

One approach might be to consider a statistic based on the total number of runs - that is, adapt the statistic from a Wald-Wolfowitz* runs test, perhaps by standardizing the number of runs (yielding a Z-score). However, you may want to "flip the sign" so that positive Z implies positive association.

Alternatively we might try to convert the number of runs to some correlation measure, but the difficulty is that if the least possible number of runs maps to $\rho=1$ and the largest possible number of runs maps to $\rho=-1$, and complete independence maps to $\rho=0$, the correlation must in general be some nonlinear function of the number of runs.


With more than two categories, the idea of using the number of runs can be extended, either

i) by constructing something more akin to a chi-square statistic (a sum of terms like $\left(\frac{R_i-E(R_i)}{S_i}\right)^2$, where $S_i$ is adjusted for the fact that the numerators are dependent); or

ii) by simply basing a statistic off the total number of runs. The distribution of the total number of runs is discussed in

Barton, D.E. and F.N. David (1957),
"Multiple Runs,"
Biometrika, Vol. 44, No. 1/2 (Jun.), pp. 168-178

and so for example, an asymptotic normal approximation is given (top p171), with formulas for the mean and variance of the number of runs also in the paper.

The approach in (i) has the disadvantage that low values indicate lack of serial dependence, and high values indicate either clumping of like terms or self-avoidance (anti-clumping); if that's a problem, the approach in (ii) may be preferable.


If we consider only two categories, of course, we can treat it as binary and compute a Pearson (auto)correlation.

This is effectively a Phi coefficient.


* The Wald-Wolfowitz runs test was not invented by Wald and Wolfowitz, but that's how it often goes.