ACF Output Interpretation – Time Series Analysis

autocorrelationtime series

 acf(c(0,1,2,3,4,5),plot=FALSE)
Autocorrelations of series ‘c(0, 1, 2, 3, 4, 5)’, by lag

     0      1      2      3      4      5 
 1.000  0.500  0.057 -0.271 -0.429 -0.357

Why does the ACF output becomes negative as lag increases? My understanding is that no matter what the lag is, the series is in general increasing. Therefore the auto-correlation should be positive. For example, at lag 2, we are calculating the correlation function of the two series [0,1,2,3] and [2,3,4,5], where the positive correlation still holds. Where do I get this wrong?

Update

Here is my intuitive understanding of the acf of an monotonically increasing sequence:

ACF of a sequence is a function $\gamma(k)$ of the lag, k. By definition, this function indeed measures the correlation between $y_t$ and $y_{t-k}$. The misunderstanding comes from the understanding of correlation. A monotonically increasing sequence is not stationary, so the mean is not stable. In another word, the sequence does not exhibit mean reverting behavior. This distorts my usual understanding of correlation (when we think about the mean level at 0). Since the mean increases over time, those observations come earlier are more likely to be lower than the sample mean, thus inducing a negative sample acf when lag is larger.

Best Answer

Let $x = (x_1, x_2, \ldots, x_n)$ be the series. Set

$$y_t = x_t - \bar{x}.$$

These are the residuals with respect to the estimated mean $\bar{x} = \frac{1}{n}\sum_{t=1}^n x_t$ of the series.

For $k=0, 1, 2, \ldots, n-1$ the acf function is computing

$$\text{acf}(x)_k = \frac{\sum_{t=1}^{n-k} y_t y_{t+k}}{\sum_{t=1}^n y_t^2}.$$

Notice that as the lag $k$ grows, there are fewer and fewer terms in the numerator as well as a shift of the indexes in the product. The reduction in number of terms in the numerator essentially forces a decrease in the value as $k$ increases. Most time series analyses consider only lags $k$ much smaller than $n$ for which this effect is negligible.

In your example where $x = (0, 1, 2, 3, 4, 5)$, $y = (-5/2, -3/5, -1/2, 1/2, 3/2, 5/2)$ initially has negative values and then moves into positive territory. For lags $k \ge 3$, the products $y_ty_{t+k}$ are pairing the early negative values with the later positive values, producing negative numbers.

Edit: Intuitive Explanation

Intuitively, $\text{acf}(x)_k$ is supposed to be telling us the correlation between a series and its lag-$k$ version. The motivation for the question is that a series like $(0, 1, \ldots, n-1)$ is perfectly correlated with all its lags for $k=0$ right through $k=n-2$. How, then, can the ACF plot produce near zero and even negative values?

There are two factors in play here. They can be seen by comparing the ACF formula to that of the usual correlation coefficient. For two series $(u_t)$ and $(w_t)$ of the same length $n-k$, let $\upsilon_t = u_t - \bar{u}$ and $\omega_t = w_t - \bar{w}$ be their residuals. (In the ensuing discussion, $(u_t)$ will be the prefix $(x_1, x_2, \ldots, x_{n-k}$ and $(w_t)$ will be the suffix $(x_{k+1}, x_{k+2}, \ldots, x_n)$.) By definition, their correlation coefficient is the average standardized residual,

$$\rho(u, w) = \frac{\sum_{t=1}^{n-k} \upsilon_t \omega_t}{\sqrt{\sum_{t=1}^{n-k} \upsilon_t^2 \sum_{t=1}^{n-k} \omega_t^2}}.$$

(The constants $\frac{1}{n-k}$ that usually appear in formulas for averages cancel in this ratio, so I have omitted them.)

When we are dealing with a single series $(x_t)$ of length $n$ and its (short) lags $k$, both $\upsilon_t$ and $\omega_t$ are essentially the same, apart from the shift of $k$ in their indexes: the first consists of the $(y_t)$ for $t$ from $1$ through $n-k$ (the high-$t$ end has been trimmed off) while the second consists of the same $(y_t)$ for $t$ from $k$ through $n$ (the low-$t$ end has been removed). If we ignore these slight differences, the denominator of $\rho(u, w)$ simplifies to

$$\sqrt{\sum_{t=1}^{n-k} \upsilon_t^2 \sum_{t=1}^{n-k} \omega_t^2} = \sqrt{\sum_{t=1}^{n-k} y_t^2 \sum_{t=1}^{n-k} y_{t+k}^2} \approx \sqrt{\sum_{t=1}^{n} y_t^2 \sum_{t=1}^n y_{t}^2} = \sqrt{\left(\sum_{t=1}^{n} y_t^2\right)^2 } = \sum_{t=1}^{n} y_t^2.$$

In making this approximation I have inserted the first $k$ terms $y_1^2 + \cdots + y_k^2$ into the sum for the suffix ($\omega_t$) and the last $k$ terms $y_{n-k+1}^2 + \cdots + y_{n}^2$ into the sum for the prefix ($\upsilon_t$). Because these are both sums of squares, they cannot decrease the denominator, and usually increase it a little bit. Accordingly, we see that using $\sum_{t=1}^n y_t^2$ in the denominator decreases the apparent correlation $\rho(u, w)$. The greater the lag $k$, the more the denominator will tend to increase, so this factor tends to reduce the high-lag values of the ACF no matter what.

The second factor has to do with the difference between the mean of the entire series $\bar{x}$ and the means of the prefix $\bar{\upsilon} = \frac{1}{n-k}\sum_{t=1}^{n-k} y_t$ and suffix $\bar{\omega} = \frac{1}{n-k}\sum_{t=k+1}^n y_t$. The ACF formula uses the former whereas the correlation coefficient formula uses the latter. We can work out the change in the numerator by comparing the ACF and correlation coefficient formulas, working algebraically to make the ACF numerator look like the $\rho$ numerator:

$$\eqalign{ \sum_{t=1}^{n-k} y_t y_{t+k} &= &\sum_{t=1}^{n-k} (x_t-\bar{x})(x_{t+k}-\bar{x}) \\ &= &\sum_{t=1}^{n-k} (x_t-\bar u + \bar u - \bar{x})(x_{t+k}-\bar w + \bar w - \bar{x}) \\ &= &\sum_{t=1}^{n-k} \left((x_t-\bar u)(x_{t+k}-\bar w) + (\bar u - \bar{x})(\bar w - \bar{x})\right) \\ &= &\left(\sum_{t=1}^{n-k} \upsilon_t \omega_t\right) + (n-k)(\bar u - \bar{x})(\bar w - \bar{x}). }$$

(The cross terms disappeared after the second line for the usual reason: they sum to zero.)

Comparing to the formula for $\rho$, we see that the discrepancy in numerators depends on the lag (in terms of $n-k$) and the products of the changes in the means, $\bar u - \bar{x}$ and $\bar w - \bar{x}$. For a stationary series and large $k$ those changes ought to be small; for small $k$ we hope they will be small but perhaps not. In the example, for instance, at lag $k=1$ the mean after dropping off the last term decreases by $1/2$ and the mean after dropping off the first term similarly increases by $1/2$. The product

$$(n-k)(\bar u - \bar{x})(\bar w - \bar{x}) = (6-1)(-1/2)(1/2) = -5/4$$

decreases the numerator in the ACF compared to the numerator in $\rho$.

The net effect of these two factors in the example is that both conspire to decrease the apparent correlation: the denominator goes up, because it includes a few more positive terms overall, and the numerator goes down, because one end of the series tends to be less than the average and the other end tends to be greater than the average. (That's more or less what a "long term trend" means, suggesting there is some evidence of non stationarity in this series.)

To illustrate the formula for the ACF, here is direct (but less efficient) R code to compute acf:

acf.0 <- function(x) {
  n <- length(x)
  y <- x - mean(x)
  sapply(1:n - 1, function(k) sum( y[1:(n-k)] * y[1:(n-k) + k] )) / sum(y * y)
}

As a test, compare the two results:

> sum((acf.0(0:5) - acf(0:5, plot=FALSE)$acf)^2)
> 6.162976e-33

The answers agree to within double precision floating point roundoff error.

Related Solutions

Solved – Ljung-Box Statistics for ARIMA residuals in R: confusing test results

You've interpreted the test wrong. If the p value is greater than 0.05 then the residuals are independent which we want for the model to be correct. If you simulate a white noise time series using the code below and use the same test for it then the p value will be greater than 0.05.

m = c(ar, ma)
w = arima.sim(m, 120)
w = ts(w)
plot(w)
Box.test(w, type="Ljung-Box")

Solved – Intuition behind cross-correlation function interpretation vs. correlation of lagged time series

The problem is not the normalisation constant, since in correlation formula it simply cancels out. The difference arises because means and variances of the series are held fixed when calculating the cross-correlations. This means that variance and means are calculated for the whole series, and they are used in calculating correlation when the length of series decreases due to lags. This is a perfectly valid operation if the series are considered stationary, i.e. with constant mean and variance.

Here is the detailed example which recreates the behaviour of ccf:

x = c(1,2,3,4,5,6,7,8,9,10)
y = c(3,3,3,5,5,5,5,7,7,11)

mx <- mean(x)
my <- mean(y)
dx <- mean((x-mx)^2)
dy <- mean((y-my)^2)
nx <- length(x)  

round(cor(x,y),3)
[1] 0.896

cr<-function(x,y,mux=mean(x),muy=mean(y),dx=var(x),dy=var(y),n=length(x)) {
    cxy<-sum((x-mux)*(y-muy))/n
    cxy/sqrt(dx*dy)
}
round(cr(x,y,mx,my,dx,dy,nx),3)
[1] 0.896

# Think "Lag -1"  
# x[-10] = 1,2,3,4,5,6,7,8,9
# y[-1] = 3,3,5,5,5,5,7,7,11
round(cor(x[-10],y[-1]),3)
[1] 0.894
round(cr(x[-10],y[-1],mx,my,dx,dy,nx),3)
[1] 0.699
# Think "Lag -2" 
# x[-10:-9] = 1,2,3,4,5,6,7,8
# y[-1:-2] = 3,5,5,5,5,7,7,11
round(cor(x[-10:-9],y[-1:-2]),3)
[1] 0.878
round(cr(x[-10:-9],y[-1:-2],mx,my,dx,dy,nx),3)
[1] 0.466

print(ccf(x,y,lag.max=3,plot=FALSE))

Autocorrelations of series ‘X’, by lag

    -3     -2     -1      0      1      2      3 
 0.197  0.466  0.699  0.896  0.436  0.221 -0.018

Note that the norming constant in the function cr is needed only because it must be the same norming constant used in the variance calculations.

Best Answer

Edit: Intuitive Explanation

Related Solutions

Solved – Ljung-Box Statistics for ARIMA residuals in R: confusing test results

Solved – Intuition behind cross-correlation function interpretation vs. correlation of lagged time series

Related Question