Solved – What to do if ACF or PACF show significant higher lags

autocorrelationautoregressivelagsmoving averagetime series

I have monthly climate data for 90 years.
I assembled the best model I could (added sensible parameters to minimize AIC), and then tried various ARMA correlation structures (using gls in lmne package in R) to reduce significant small (<30) lags. I then selected the ARMA model with the lowest AIC as my best model.

However, based on the ACF and PACF plots, there are still significant larger-interval lags (>30).

ACF_PACF_GDD_NC graphs

My questions are:

  1. How should I react to that? Do I consider them to be important or spurious?

    • I initially assumed that if lag 60 (associated with 5 year) was significant, then this would indicate that there is a 5 year trend in my data. However, I thought I'd heard before that ACF/PACF is not a good way to approach long-term lags.
  2. What do I do with this? How would I go about reducing the larger lags?

    • For example, is there a specific ARMA p/q combo that 'best' reduces larger lags? Or should I try adding sin/cos variables in my model? Or some other approach?

    • Again, if the ACF/PACF are not good for IDing large lags, how would I determine the 'real' long-term cyclic patterns to actually account for?

Best Answer

[I believe this is a duplicate - and while I can find questions with this issue explained in comments, the couple that explain it correctly and fully in answers aren't really answering the same question. There probably is a good duplicate somewhere but since I couldn't locate one -- this will serve as an answer in the meantime.]

If you choose your cut-off for significance for each lag to be a 95% interval (so you conclude the ACF or PACF at leach lag was non-zero if it was larger in magnitude than the boundary of the 95% interval) then when there were no non-zero population ACF or PACF values, you'd expect to see 5% of your sample values outside the bounds. (If your sample ACF or PACF values for each lag were independent of each other, the number outside would be binomial($l,0.05$), where $l$ is the number of different lags considered.)

So even if nothing were going on, you should see about 5% of the values somewhat outside the bounds. For example, with 120 lags you expect 6 to be outside the bounds with no population autocorrelations being non-zero, but of course random variation means you may see s bit more than 6 or fewer than 6.

Indeed, getting none of the values outside those 95% lines would be suspicious - you shouldn't see that!

So there's no reason on the basis of what we see in those diplays to think that there's anything going on -- the displays are quite consistent with independence in the series you computed the ACF of.