Solved – Data Augmentation strategies for Time Series Forecasting

data augmentationtime series

I'm considering two strategies to do "data augmentation" on time-series forecasting.

First, a little bit of background. A predictor $P$ to forecast the next step of a time-series $\lbrace A_i\rbrace$ is a function that typically depends on two things, the time-series past states, but also the predictor's past states:

$$P(\lbrace A_{i\leq t-1}\rbrace,P_{S_{t-1}})$$

If we want to adjust/train our system to obtain a good $P$, then we'll need enough data. Sometimes available data won't be enough, so we consider doing data augmentation.

First approach

Suppose we have the time-series $\lbrace A_i \rbrace$, with $1 \leq i \leq n$. And suppose also that we have $\epsilon$ that meets the following condition: $0<\epsilon < |A_{i+1} – A_i| \forall i \in \lbrace 1, \ldots,n\rbrace$.

We can construct a new time series $\lbrace B_i = A_i+r_i\rbrace$, where $r_i$ is a realization of the distribution $N(0,\frac{\epsilon}{2}) $.

Then, instead of minimizing the loss function only over $\lbrace A_i \rbrace$, we do that also over $\lbrace B_i \rbrace$. So, if the optimization process takes $m$ steps, we have to "initialize" the predictor $2m$ times, and we'll compute approximately $2m(n-1)$ predictor internal states.

Second approach

We compute $\lbrace B_i \rbrace$ as before, but we don't update the predictor's internal state using $\lbrace B_i \rbrace$, but $\lbrace A_i \rbrace$. We only use the two series together at the time of computing the loss function, so we'll compute approximately $m(n-1)$ predictor internal states.

Of course, there is less computational work here (although the algorithm is a little bit uglier), but it does not matter for now.

The doubt

The problem is: from a statistical point of view, which is the the "best" option? And why?

My intuition tells me that the first one is better, because it helps to "regularize" the weights related with the internal state, while the second one only helps to regularize the weights related with the observed time-series' past.

Extra:

Any other ideas to do data augmentation for time series forecasting?
How to weight the synthetic data in the training set?

Best Answer

Any other ideas to do data augmentation for time series forecasting?

I'm currently thinking about the same problem. I've found the paper "Data Augmentation for Time Series Classification using Convolutional Neural Networks" by Le Guennec et al. which doesn't cover forecasting however. Still the augmentation methods mentioned there look promising. The authors communicate 2 methods:

Window Slicing (WS)

A first method that is inspired from the computer vision community [8,10] consists in extracting slices from time series and performing classification at the slice level. This method has been introduced for time series in [6]. At training, each slice extracted from a time series of class y is assigned the same class and a classifier is learned using the slices. The size of the slice is a parameter of this method. At test time, each slice from a test time series is classified using the learned classifier and a majority vote is performed to decide a predicted label. This method is referred to as window slicing (WS) in the following.

Window Warping (WW)

The last data augmentation technique we use is more time-series specific. It consists in warping a randomly selected slice of a time series by speeding it up or down, as shown in Fig. 2. The size of the original slice is a parameter of this method. Fig. 2 shows a time series from the “ECG200” dataset and corresponding transformed data. Note that this method generates input time series of different lengths. To deal with this issue, we perform window slicing on transformed time series for all to have equal length. In this paper, we only consider warping ratios equal to 0.5 or 2, but other ratios could be used and the optimal ratio could even be fine tuned through cross-validation on the training set. In the following, this method will be referred to as window warping (WW).

Fig. 2 from paper

The authors kept 90% of the series unchanged (i.e. WS was set to a 90% slice and for WW 10% of the series were warped). The methods are reported to reduce classification error on several types of (time) series data, except on 1D representations of image outlines. The authors took their data from here: http://timeseriesclassification.com

How to weight the synthetic data in the training set?

In image augmentation, since the augmentation isn't expected to change the class of an image, it's afaik common to weight it as any real data. Time series forecasting (and even time series classification) might be different:

A time series is not easily perceivable as a contiguous object for humans, so depending on how much you tamper with it, is it still the same class? If you only slice and warp a little and classes are visually distinct, this might not pose a problem for classification tasks
For forecasting, I would argue that

2.1 WS is still a nice method. No matter at which 90%-part of the series you look, you would still expect a forecast based on the same rules => full weight.

2.2 WW: The closer it happens to the end of the series, the more cautious I would be. Intuitively, I would come up with a weight factor sliding between 0 (warping at the end) and 1 (warping at the beginning), assuming that the most recent features of the curve are the most relevant.

Demonstration using the `DiceKriging` model in R.

The first image shows a fit of the trend line $y(t) = \beta_0 + \beta_1 t + \beta_2 t^2 +\beta_3 \sin(2 \pi t) + \beta_4 \sin(2 \pi t)$.

The 95% confidence interval is much smaller than compared with the arima image. But note that the residual term is also very small and there are a lot of datapoints. For comparison three other fits are made.

A simpler (linear) model with less datapoints is fit. Here you can see the effect of the error in the trend line causing the prediction confidence interval to increase when extrapolating further away (this confidence interval is also only as much correct as the model is correct).
An ordinary least squares model. You can see that it provides more or less the same confidence interval as the Gaussian process model
An ordinary Kriging model. This is a gaussian process without the trend included. The predicted values will be equal to the mean when you extrapolate far away. The error estimate is large because the residual terms (data-mean) are large.

library(DiceKriging)
library(datasets)


# data
y <- as.numeric(co2)
x <- c(1:length(y))/12

# design-matrix 
# the model is a linear sum of x, x^2, sin(2*pi*x), and cos(2*pi*x)
xm <- cbind(rep(1,length(x)),x, x^2, sin(2*pi*x), cos(2*pi*x))
colnames(xm) <- c("i","x","x2","sin","cos")

# fitting non-stationary Gaussian processes 
epsilon <- 10^-3
fit1 <- km(formula= ~x+x2+sin+cos, 
          design = as.data.frame(xm[,-1]), 
          response = as.data.frame(y),
          covtype="matern3_2", nugget=epsilon)

# fitting simpler model and with less data (5 years)
epsilon <- 10^-3
fit2 <- km(formula= ~x, 
           design = data.frame(x=x[120:180]), 
           response = data.frame(y=y[120:180]),
           covtype="matern3_2", nugget=epsilon)

# fitting OLS
fit3 <- lm(y~1+x+x2+sin+cos, data = as.data.frame(cbind(y,xm)))

# ordinary kriging 
epsilon <- 10^-3
fit4 <- km(formula= ~1, 
           design = data.frame(x=x), 
           response = data.frame(y=y),
           covtype="matern3_2", nugget=epsilon)


# predictions and errors
newx <- seq(0,80,1/12/4)
newxm <- cbind(rep(1,length(newx)),newx, newx^2, sin(2*pi*newx), cos(2*pi*newx))
colnames(newxm) <- c("i","x","x2","sin","cos")
# using the type="UK" 'universal kriging' in the predict function
# makes the prediction for the SE take into account the variance of model parameter estimates
newy1 <- predict(fit1, type="UK", newdata = as.data.frame(newxm[,-1]))
newy2 <- predict(fit2, type="UK", newdata = data.frame(x=newx))
newy3 <- predict(fit3, interval = "confidence", newdata = as.data.frame(x=newxm))
newy4 <- predict(fit4, type="UK", newdata = data.frame(x=newx))

# plotting
plot(1959-1/24+newx, newy1$mean,
 col = 1, type = "l",
 xlim = c(1959, 2039), ylim=c(300, 480),
 xlab = "time [years]", ylab = "atmospheric CO2 [ppm]")
polygon(c(rev(1959-1/24+newx), 1959-1/24+newx), c(rev(newy1$lower95), newy1$upper95), 
        col = rgb(0,0,0,0.3), border = NA)
points(1959-1/24+x, y, pch=21, cex=0.3, col=1, bg="white")
title("Gausian process with polynomial + trigonometric function for trend")

# plotting
plot(1959-1/24+newx, newy2$mean,
 col = 2, type = "l",
 xlim = c(1959, 2010), ylim=c(300, 380),
 xlab = "time [years]", ylab = "atmospheric CO2 [ppm]")
polygon(c(rev(1959-1/24+newx), 1959-1/24+newx), c(rev(newy2$lower95), newy2$upper95), 
        col = rgb(1,0,0,0.3), border = NA)
points(1959-1/24+x, y, pch=21, cex=0.5, col=1, bg="white")
points(1959-1/24+x[120:180], y[120:180], pch=21, cex=0.5, col=1, bg=2)
title("Gausian process with linear function for trend")

# plotting
plot(1959-1/24+newx, newy3[,1],
     col = 1, type = "l",
     xlim = c(1959, 2039), ylim=c(300, 480),
     xlab = "time [years]", ylab = "atmospheric CO2 [ppm]")
polygon(c(rev(1959-1/24+newx), 1959-1/24+newx), c(rev(newy3[,2]), newy3[,3]), 
        col = rgb(0,0,0,0.3), border = NA)
points(1959-1/24+x, y, pch=21, cex=0.3, col=1, bg="white")
title("Ordinory linear regression with polynomial + trigonometric function for trend")


# plotting
plot(1959-1/24+newx, newy4$mean,
 col = 1, type = "l",
 xlim = c(1959, 2039), ylim=c(300, 480),
 xlab = "time [years]", ylab = "atmospheric CO2 [ppm]")
polygon(c(rev(1959-1/24+newx), 1959-1/24+newx), c(rev(newy4$lower95), newy4$upper95), 
        col = rgb(0,0,0,0.3), border = NA, lwd=0.01)
points(1959-1/24+x, y, pch=21, cex=0.5, col=1, bg="white")
title("ordinary kriging")

Best Answer

Related Solutions

Time Series Forecasting – Why Gaussian Processes Are Valid Statistical Models

Demonstration using the DiceKriging model in R.

Related Question

Demonstration using the `DiceKriging` model in R.