Time Series – Necessity of Correlation Structures in Seasonal GAM

generalized-additive-modelmgcvtime series

I am currently modellingusing the 'mgcv' package in R. My response variable is called log.tr, representing the log of residence time. My data looks a little bit like this:

set.seed(123)
logtr <- seq(0.67, 4.29, length.out = 100)
day_ <- sample(1:365, 100, replace = TRUE)  # Random day values
year_ <- sample(2000:2020, 100, replace = TRUE)  # Random year values
TEMPERATURE <- rnorm(100, mean = 25, sd = 5)  # Random temperature values
gam_tres_df <- data.frame(log.tr = logtr, day_ = day_, year_ = year_, TEMPERATURE = TEMPERATURE)

I am attempting to fit a generalized additive model (GAM) using the 'gam' function from 'mgcv'.

The model formula I am using is:

gam(log.tr ~ s(day_, k = 40, bs = 'cc') + s(year_, k = 12) + s(TEMPERATURE, k = 40) + s(day_, year_), method = 'REML', data = gam_tres_df)

In this formula, I have included smooth terms for the variables day_, year_, TEMPERATURE, and an interaction term between day_ and year_. I have chosen specific degrees of freedom (k) for each smooth term.

However, I suspect that the variables day_ and year_ may be dependant. I am considering adding a correlation structure to the model to account for this potential correlation. Specifically, I am thinking of using the corARMA function with the form ~ 1|Year and an autoregressive moving average (ARMA) model with a lag of 3 (p = 3).

My question is whether I should include this correlation structure (corARMA(form = ~ 1|Year, p = 3) or maybe (form = ~ day | year)) in my GAM model or if the current model specification without the correlation structure is appropriate. Also I wanted to know how to define this correlation structure for seasonal daily data; since i want to consider residual variation from year to year.

Please let me know if you need further information or have any additional questions!

Best Answer

  1. You're likely better off modelling the residence time itself, not it's log transform, using a non-Gaussian family, but that could well be tricky if you do actually need the autocorrelation structure.

  2. You can't use the correlation with gam(); you can only use this with gamm(), FYI; gam() will silently ignore this argument if you provide it.

  3. There shouldn't really be a relationship between day_ and year_, their effects are unlikely to be similar; one is a within-year effect, the other a between-year effect. There may be an interaction between these two variables, if say the seasonal pattern has changed over time due to say climate change. These variables are not dependent however.

  4. The correlation structure available via gamm() (provided by the nlme package) don't account for correlation or dependence among covariates. They account for un-modelled correlation among observations.

  5. The correlation structures available do not cover seasonal ARMA models, only ARMA models.

  6. corARMA(form = ~ 1|year_, p = 3) and corARMA(form = ~ day_ | year_), p = 3) should give the same results if the data are in time order within year_. Both will describe an AR(3) nested within year, using the same autocorrelation parameter $\rho$

  7. As mentioned in 4. you only need it if there is unmodelled temporal autocorrelation. With 40 df for the day_ term, you can model quite a complex seasonal effect so maybe you won't need the AR(3)? You might want to consider using te(day_, year_, ...) to allow the seasonal signal to vary with the trend, and then you might not need the autocorrelation.

  8. Whether you need the autocorrelation structure will depend on how wiggly you want the smooths of day_ and year_ to be; all else equal, you can model the autocorrelation with wiggly smooth functions and then you won't need the autocorrelation structure. But if you are trying to estimate the seasonal and between-year trends without this autocorrelation (so you want simpler, smoother functions) then it is likely that you'll need the autocorrelation structure if you have data recorded at sufficient time resolution to separate the larger scales of temporal variation (seasonal and long-term trend) from the short-scale variation of the autocorrelation.

  9. You can fit models with no AR(p), and AR(1), an AR(2), and then an AR(3) and use tools to select among these 4 models. I have some example code here showing how to do this (for models with AR(1) vs AR(2) for example): https://fromthebottomoftheheap.net/2016/03/25/additive-modeling-global-temperature-series-revisited/