Solved – Is an auto-correlation plot suitable for determining at what point time series data has become random, and how does one interpret the plot

autocorrelationdata visualizationlagstime series

A piece of research I am working on requires us to decide at what point time series data has become random. For what it is worth, the time sequence in question is a collection of in-process timings for repetitions of a computer program benchmark.

EDIT2: It may be easier to think about the desired state as "the point at which readings have mostly stabilised with only small random variations.

EDIT (addressing @juho-kokkala's comment): Under Utopian circumstances, the results from repeatedly measuring the time a computer benchmark takes to execute (within a single process) should be pretty much random. We would expect small random variations to be introduced by the operating system's scheduler for example. With JITs (just in time compilers) however, execution starts in a slow interpreter, and as the compiler detects "hot code", parts of the program are compiled to native code. Since the interpreter is slow, and since compilation also costs time, we expect data points at the beginning of the time series to take longer to execute. Then later, when compilation has stopped, and when most of the program now executes as native code, we would hope to see the time series revert to a state of small random variations (this is a simplified view on our domain. There are other mechanisms which get in the way, e.g. garbage collecion). The point at which this state has been reached is relevant: it is common to discard data points prior, thus measuring "peak performance" of the system running the benchmark.

It has been suggested in this paper that a combination of lag and autocorrelation plots could help. The authors suggest that once the correlated prefix has been disregarded:

The lag plots should not have clusters.
The auto-correlation plots should indicate low correlation.

My question relates to the interpretation of the auto-correlation plot. Let me first give an example of a situation that does not appear to stabilise.

Here is a run sequence plot for a time series:

And here is the corresponding autocorrelation plot as generated by Pandas:

According to the documentation for the auto-correlation function in pandas:

If time series is non-random then one or more of the auto-correlations will be significantly non-zero

On the auto-correlation plot, the horizontal lines indicate confidence bands:

The horizontal lines displayed on the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.

I think pandas is normalising the data.

We can see that the data is not random just by looking at the run sequence graph. There is a clear pattern.

The auto-correlation plot appears to suggest that there is a high correlation for small lag values. Moreover, as you increase the lag value, the data appears more and more random, until at lag 170 (ish) the auto-correlation values fall inside confidence bands.

I'm very convinced that the wrong way to read this is:

After 170 iterations, the data is suitably random.

Would anyone be able to explain intuitively the relevance of this gradually inward sloping correlation value? What does it mean for the correlation value to move inside the confidence bands at 170 iterations?

A sub-question I would like to pose is, is there a better technique for what we are trying to achieve here?

Useful links:

Thanks!

EDIT3: Thanks @kyler-brown! Here is the unbiased autocorrelation plot for the same data with maxlags set to the size of the data set (200):

Indeed the graph is no longer tapering off at higher lags. Notice is that there is a dip on the far right. I think this is due to the fact that, as the lag value converges upon the size of the data set, there are fewer possible data-points, and thus the approach breaks down. If we use a maxlags value of 100, there is no such artefact:

In terms of what the plot shows. I think I am right to say that the peaks at $\{4, 8, 12, 16, …\}$ are showing that samples situated $\{4, 8, 12, 16, …\}$ apart are higly correlated. This seems to check out, given that the cycles we see in the run sequence plot are roughly 4 wide.

I would be interested to see if I have correctly understood.

EDIT4:

Having searched the internet a bit, I think I have found a technique which can complement an autocorrelation plot: seasonal decomposition. I think this can be used to address the fact that, autocorrelation plots don't always make it easy to spot trends.

To illustrate, I inserted a subtle trend into our data as follows:

    for i in range(len(data)):
        data[i] = data[i] + i * 0.0001

The unbiased autocorrelation plot for this data looks like this:

It's hard to see a trend. The following graph shows the seasonal decomposition of our time series data using a frequency of 4 (which we determined above):

The plot shows:

The original "observed" data.
The overall trend, separate from the cycles (a.k.a. seasons) and residual noise.
The seasons separate from the trend and residual noise.
Residual noise separate from the trend and the seasons.

Notice that you can see the upward trend clearly here.

Another good example is given in the statsmodel docs. Here the seasons are extracted from a fairly exaggerated upward trend.

Best Answer

Autocorrelation lags are created by taking a pair of values at a given lag, multiplying the pair and summing across all pairs. Because your signal has a finite length, large autocorrelation lags have fewer and fewer pairs summed together, and thus are smaller values. You can compensate for this by using an "unbiased" autocorrelation. Statsmodels acf has an option to return an unbiased estimate: http://statsmodels.sourceforge.net/stable/generated/statsmodels.tsa.stattools.acf.html

Edit: Intuitive Explanation

Intuitively, $\text{acf}(x)_k$ is supposed to be telling us the correlation between a series and its lag-$k$ version. The motivation for the question is that a series like $(0, 1, \ldots, n-1)$ is perfectly correlated with all its lags for $k=0$ right through $k=n-2$. How, then, can the ACF plot produce near zero and even negative values?

There are two factors in play here. They can be seen by comparing the ACF formula to that of the usual correlation coefficient. For two series $(u_t)$ and $(w_t)$ of the same length $n-k$, let $\upsilon_t = u_t - \bar{u}$ and $\omega_t = w_t - \bar{w}$ be their residuals. (In the ensuing discussion, $(u_t)$ will be the prefix $(x_1, x_2, \ldots, x_{n-k}$ and $(w_t)$ will be the suffix $(x_{k+1}, x_{k+2}, \ldots, x_n)$.) By definition, their correlation coefficient is the average standardized residual,

$$\rho(u, w) = \frac{\sum_{t=1}^{n-k} \upsilon_t \omega_t}{\sqrt{\sum_{t=1}^{n-k} \upsilon_t^2 \sum_{t=1}^{n-k} \omega_t^2}}.$$

(The constants $\frac{1}{n-k}$ that usually appear in formulas for averages cancel in this ratio, so I have omitted them.)

When we are dealing with a single series $(x_t)$ of length $n$ and its (short) lags $k$, both $\upsilon_t$ and $\omega_t$ are essentially the same, apart from the shift of $k$ in their indexes: the first consists of the $(y_t)$ for $t$ from $1$ through $n-k$ (the high-$t$ end has been trimmed off) while the second consists of the same $(y_t)$ for $t$ from $k$ through $n$ (the low-$t$ end has been removed). If we ignore these slight differences, the denominator of $\rho(u, w)$ simplifies to

$$\sqrt{\sum_{t=1}^{n-k} \upsilon_t^2 \sum_{t=1}^{n-k} \omega_t^2} = \sqrt{\sum_{t=1}^{n-k} y_t^2 \sum_{t=1}^{n-k} y_{t+k}^2} \approx \sqrt{\sum_{t=1}^{n} y_t^2 \sum_{t=1}^n y_{t}^2} = \sqrt{\left(\sum_{t=1}^{n} y_t^2\right)^2 } = \sum_{t=1}^{n} y_t^2.$$

In making this approximation I have inserted the first $k$ terms $y_1^2 + \cdots + y_k^2$ into the sum for the suffix ($\omega_t$) and the last $k$ terms $y_{n-k+1}^2 + \cdots + y_{n}^2$ into the sum for the prefix ($\upsilon_t$). Because these are both sums of squares, they cannot decrease the denominator, and usually increase it a little bit. Accordingly, we see that using $\sum_{t=1}^n y_t^2$ in the denominator decreases the apparent correlation $\rho(u, w)$. The greater the lag $k$, the more the denominator will tend to increase, so this factor tends to reduce the high-lag values of the ACF no matter what.

The second factor has to do with the difference between the mean of the entire series $\bar{x}$ and the means of the prefix $\bar{\upsilon} = \frac{1}{n-k}\sum_{t=1}^{n-k} y_t$ and suffix $\bar{\omega} = \frac{1}{n-k}\sum_{t=k+1}^n y_t$. The ACF formula uses the former whereas the correlation coefficient formula uses the latter. We can work out the change in the numerator by comparing the ACF and correlation coefficient formulas, working algebraically to make the ACF numerator look like the $\rho$ numerator:

$$\eqalign{ \sum_{t=1}^{n-k} y_t y_{t+k} &= &\sum_{t=1}^{n-k} (x_t-\bar{x})(x_{t+k}-\bar{x}) \\ &= &\sum_{t=1}^{n-k} (x_t-\bar u + \bar u - \bar{x})(x_{t+k}-\bar w + \bar w - \bar{x}) \\ &= &\sum_{t=1}^{n-k} \left((x_t-\bar u)(x_{t+k}-\bar w) + (\bar u - \bar{x})(\bar w - \bar{x})\right) \\ &= &\left(\sum_{t=1}^{n-k} \upsilon_t \omega_t\right) + (n-k)(\bar u - \bar{x})(\bar w - \bar{x}). }$$

(The cross terms disappeared after the second line for the usual reason: they sum to zero.)

Comparing to the formula for $\rho$, we see that the discrepancy in numerators depends on the lag (in terms of $n-k$) and the products of the changes in the means, $\bar u - \bar{x}$ and $\bar w - \bar{x}$. For a stationary series and large $k$ those changes ought to be small; for small $k$ we hope they will be small but perhaps not. In the example, for instance, at lag $k=1$ the mean after dropping off the last term decreases by $1/2$ and the mean after dropping off the first term similarly increases by $1/2$. The product

$$(n-k)(\bar u - \bar{x})(\bar w - \bar{x}) = (6-1)(-1/2)(1/2) = -5/4$$

decreases the numerator in the ACF compared to the numerator in $\rho$.

The net effect of these two factors in the example is that both conspire to decrease the apparent correlation: the denominator goes up, because it includes a few more positive terms overall, and the numerator goes down, because one end of the series tends to be less than the average and the other end tends to be greater than the average. (That's more or less what a "long term trend" means, suggesting there is some evidence of non stationarity in this series.)

To illustrate the formula for the ACF, here is direct (but less efficient) R code to compute acf:

acf.0 <- function(x) {
  n <- length(x)
  y <- x - mean(x)
  sapply(1:n - 1, function(k) sum( y[1:(n-k)] * y[1:(n-k) + k] )) / sum(y * y)
}

As a test, compare the two results:

> sum((acf.0(0:5) - acf(0:5, plot=FALSE)$acf)^2)
> 6.162976e-33

The answers agree to within double precision floating point roundoff error.

Solved – SAS Residual White Noise Test and validity of $R^2$ value

If the data set is seasonal say monthly it is possible that if this is left untreated in your model one could get rejection beyond lag 10. In general the bands around a sample acf/pacf are VERY APPROXIMATE and should be looked at with skepticism as they are based upon a VERY APPROXIMATE standard error prewsumption.

Best Answer

Related Solutions

Solved – Understanding this acf output

Edit: Intuitive Explanation

Solved – SAS Residual White Noise Test and validity of $R^2$ value

Related Question