Autocorrelation Time – Definition for Effective Sample Size

correlationrtime series

I've found two definitions in the literature for the autocorrelation time of a weakly stationary time series:

$$
\tau_a = 1+2\sum_{k=1}^\infty \rho_k \quad \text{versus} \quad \tau_b = 1+2\sum_{k=1}^\infty \left|\rho_k\right|
$$

where $\rho_k = \frac{\text{Cov}[X_t,X_{t+h}]}{\text{Var}[X_t]}$ is the autocorrelation at lag $k$.

One application of the autocorrelation time is to find the "effective sample size": if you have $n$ observations of a time series, and you know its autocorrelation time $\tau$, then you can pretend that you have

$$
n_\text{eff} = \frac{n}{\tau}
$$

independent samples instead of $n$ correlated ones for the purposes of finding the mean. Estimating $\tau$ from data is non-trivial, but there are a few ways of doing it (see Thompson 2010).

The definition without absolute values, $\tau_a$, seems more common in the literature; but it admits the possibility of $\tau_a<1$. Using R and the "coda" package:

require(coda)
ts.uncorr <- arima.sim(model=list(),n=10000)         # white noise 
ts.corr <- arima.sim(model=list(ar=-0.5),n=10000)    # AR(1)
effectiveSize(ts.uncorr)                             # Sanity check
    # result should be close to 10000
effectiveSize(ts.corr)
    # result is in the neighborhood of 30000... ???

The "effectiveSize" function in "coda" uses a definition of the autocorrelation time equivalent to $\tau_a$, above. There are some other R packages out there that compute effective sample size or autocorrelation time, and all the ones I've tried give results consistent with this: that an AR(1) process with a negative AR coefficient has more effective samples than the correlated time series. This seems strange.

Obviously, this can never happen in the $\tau_b$ definition of autocorrelation time.

What is the correct definition of autocorrelation time? Is there something wrong with my understanding of effective sample sizes? The $n_\text{eff} > n$ result shown above seems like it must be wrong… what's going on?

Best Answer

First, the appropriate definition of "effective sample size" is IMO linked to a quite specific question. If $X_1, X_2, \ldots$ are identically distributed with mean $\mu$ and variance 1 the empirical mean $$\hat{\mu} = \frac{1}{n} \sum_{k=1}^n X_k$$ is an unbiased estimator of $\mu$. But what about its variance? For independent variables the variance is $n^{-1}$. For a weakly stationary time series, the variance of $\hat{\mu}$ is $$\frac{1}{n^2} \sum_{k, l=1}^n \text{cov}(X_k, X_l) = \frac{1}{n}\left(1 + 2\left(\frac{n-1}{n} \rho_1 + \frac{n-2}{n} \rho_2 + \ldots + \frac{1}{n} \rho_{n-1}\right) \right) \simeq \frac{\tau_a}{n}.$$ The approximation is valid for large enough $n$. If we define $n_{\text{eff}} = n/\tau_a$, the variance of the empirical mean for a weakly stationary time series is approximately $n_{\text{eff}}^{-1}$, which is the same variance as if we had $n_{\text{eff}}$ independent samples. Thus $n_{\text{eff}} = n/\tau_a$ is an appropriate definition if we ask for the variance of the empirical average. It might be inappropriate for other purposes.

With a negative correlation between observations it is certainly possible that the variance can become smaller than $n^{-1}$ ($n_{\text{eff}} > n$). This is a well known variance reduction technique in Monto Carlo integration: If we introduce negative correlation between the variables instead of correlation 0, we can reduce the variance without increasing the sample size.

Related Question