I have a dataset of water temperature measurements taken from a large waterbody at irregular intervals over a period of decades. (Galveston Bay, TX if you’re interested)
Here’s the head of the data:
STATION_ID DATE TIME LATITUDE LONGITUDE YEAR MONTH DAY SEASON MEASUREMENT
1 13296 6/20/91 11:04 29.50889 -94.75806 1991 6 20 Summer 28.0
2 13296 3/17/92 9:30 29.50889 -94.75806 1992 3 17 Spring 20.1
3 13296 9/23/91 11:24 29.50889 -94.75806 1991 9 23 Fall 26.0
4 13296 9/23/91 11:24 29.50889 -94.75806 1991 9 23 Fall 26.0
5 13296 6/20/91 11:04 29.50889 -94.75806 1991 6 20 Summer 28.0
6 13296 12/17/91 10:15 29.50889 -94.75806 1991 12 17 Winter 13.0
(MEASUREMENT is the temperature measurement of interest.)
The full set is available here: https://github.com/jscarlton/galvBayData/blob/master/gbtemp.csv
I would like to remove the effects of seasonal variation to observe the trend (if any) in the temperature over time. Is a time series decomposition the best way to do this? How do I handle the fact that the measurements were not taken at a regular interval? I'm hoping there is an R package for this type of analysis, though Python or Stata would be fine, too.
(Note: for this analysis, I’m choosing to ignore the spatial variability in the measurements. Ideally, I’d account for that as well, but I think that doing so would be hopelessly complex.)
Best Answer
Rather than try to decompose the time series explicitly, I would instead suggest that you model the data spatio-temporally because, as you'll see below, the long-term trend likely varies spatially, the seasonal trend varies with the long-term trend and spatially.
I have found that generalised additive models (GAMs) are a good model for fitting irregular time series such as you describe.
Below I illustrate a quick model I prepared for the full data of the following form
\begin{align} \begin{split} \mathrm{E}(y_i) & = \alpha + f_1(\text{ToD}_i) + f_2(\text{DoY}_i) + f_3(\text{Year}_i) + f_4(\text{x}_i, \text{y}_i) + \\ & \quad f_5(\text{DoY}_i, \text{Year}_i) + f_6(\text{x}_i, \text{y}_i, \text{ToD}_i) + \\ & \quad f_7(\text{x}_i, \text{y}_i, \text{DoY}_i) + f_8(\text{x}_i, \text{y}_i, \text{Year}_i) \end{split} \end{align}
where
Effectively, the first four smooths are the main effects of
whilst the remaining four tensor product smooths model smooth interactions between the stated covariates, which model
The data are loaded into R and massaged a bit with the following code
The model itself is fitted using the
bam()
function which is designed for fitting GAMs to larger data sets such as this. You can usegam()
for this model also, but it will take somewhat longer to fit.The
s()
terms are the main effects, whilst theti()
terms are tensor product interaction smooths where the main effects of the named covariates have been removed from the basis. Theseti()
smooths are a way to include interactions of the stated variables in a numerically stable way.The
knots
object is just setting the endpoints of the cyclic smooth I used for the day of year effect — we want 23:59 on Dec 31st to join up smoothly with 00:01 Jan 1st. This accounts to some extent for leap years.The model summary indicates all these effects are significant;
A more careful analysis would want to check if we need all these interactions; some of the spatial
ti()
terms explain only small amounts of variation in the data, as indicated by the $F$ statistic; there's a lot of data here so even small effect sizes may be statistically significant but uninteresting.As a quick check, however, removing the three spatial
ti()
smooths (m.sub
), results in a significantly poorer fit as assessed by AIC:We can plot the partial effects of the first five smooths using the
plot()
method — the 3D tensor product smooths can't be plotted easily and not by default.The
scale = 0
argument there puts all the plots on their own scale, to compare the magnitudes of the effects, we can turn this off:Now we can see that the seasonal effect dominates. The long-term trend (on average) is shown in the upper right plot. To really look at the long-term trend however, you need to pick a station and then predict from the model for that station, fixing time of day and day of year to some representative values (midday, for a day of the year in summer say). There early year or two of the series has some low temperature values relative to the rest of the records, which is likely being picked up in all the smooths involving
YEAR
. These data should be looked at more closely.This isn't really the place to get into that, but here are a couple of visualisations of the model fits. First I look at the spatial pattern of temperature and how it varies over the years of the series. To do that I predict from the model for a 100x100 grid over the spatial domain, at midday on day 180 of each year:
then I set to missing,
NA
, the predicted valuesfit
for all data points that lie some distance from the observations (proportional;dist
)and join the predictions to the prediction data
Setting predicted values to
NA
like this stops us extrapolating beyond the support of the data.Using ggplot2
we obtain the following
We can see the year-to-year variation in temperatures in a bit more detail if we animate rather than facet the plot
To look at the long-term trends in more detail, we can predict for particular stations. For example, for
STATION_ID
13364 and predicting for days in the four quarters, we might use the following to prepare values of the covariates we want to predict at (midday, on day of year 1, 90, 180, and 270, at the selected station, and evaluating the long-term trend at 500 equally spaced values)Then we predict and ask for standard errors, to form an approximate pointwise 95% confidence interval
which we plot
producing
Obviously, there's a lot more to modelling these data than what I show here, and we'd want to check for residual autocorrelation and overfitting of the splines, but approaching the problem as one of modelling the features of the data allows for a more detailed examination of the trends.
You could of course just model each
STATION_ID
separately, but that would throw away data, and many stations have few observations. Here the model borrows from all the station information to fill in the gaps and assist in estimating the trends of interest.Some notes on
bam()
The
bam()
model is using all of mgcv's tricks to estimate the model quickly (multiple threads 4), fast REML smoothness selection (method = 'fREML'
), and discretization of covariates. With these options turned on the model fits in less than a minute on my 2013-era dual 4-core Xeon workstation with 64Gb of RAM.