Solved – Cross-validation for timeseries data with regression

cross-validationforecastinglagsmachine learningtime series

I am familiar with "regular" cross-validation, but now I want to make timeseries predictions while using cross-validation with a simple linear regression function.
I write down a simple example, to help clarify my two questions: one about the train/test split, one question about how to train/test for models when the aim is to predict for different n, with n the steps of prediction, in advance.

(1) The data

Assume I have data for timepoints 1,…,10 as follows:

timeseries = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]

(2) Transforming the data into a format useful for supervised learning

As far as I understand, we can use "lags", i.e. shifts in the data to create a dataset suited for supervised learning:

input = [NaN,0.5,0.3,10,4,5,6,1,0.4,0.1]
output/response = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]

Here I have simply shifted the timeseries by one for creating the output vector.
As far as I understand, I could now use input as the input for a linear regression model, and output for the response (the NaN could be approximated our replaced with a random value).

(3) Question 1: Cross-validation ("backtesting")

Say I want to do now 2-splits, do I have to shift the train as well as the test sets?

I.e. something like:

Train-set:

Independent variable: [NaN,0.5,0.3,10,4,5]

Output/response variable:[0.5,0.3,10,4,5,6]

Test-set:

Independent variable: [1,0.4,0.1]

Output/response variable:[0.4,0.1,0.9]

(ii) Question 2: Predicting different lags in advance:

As obvious, I have shifted dependent to independent variables by 1. Assuming now I would like to train a model which can predict 5 time steps in advance — can I keep this lag of one, and nevertheless use the model to predict n+1,…,n+5,… or do I change the shift from independent to dependent variable to 5? What exactly is the difference?

Best Answer

For the first question, as Richard Hardy points out, there is an excellent blog post on the topic. There is also this post and this post which I have found very helpful.

For the second question, you need to take into account the two basic approaches to multistep times series forecasting: Recursive forecasting and direct forecasting:

  • In recursive forecasting (also called iterated forecasting) you train your model for one step ahead forecasts only. After the training is done you apply your final model recursively to forecast 1 step ahead, 2 steps ahead, etc...until you reach the desired $n$ steps forecast horizon. To do this, you feed the forecast from each successive step back into the model to generate the next step. This approach is used by traditional forecasting algorithms like ARIMA and Exponential Smoothing algorithms, and can be also used for Machine Learning based forecasting (see this post for an example, and this post for some discussion).
  • Direct forecasting is when you train a separate model for each step (so you trying to "directly" forecast the $n^{th}$ step ahead instead of reaching $n$ steps recursively. See Ben Taied et al. for a discussion of direct forecasting and more complex combined approaches.

Note that Hyndman's blog post on cross validation for time series covers both one step ahead and direct forecasting.


To clarify recursive forecasting (based on the comments):

  1. First you train your model.
  2. Once training is done, you take $[Y_1, Y_2,....Y_t]$ to calculate $\hat{Y}_{t+1}$ (this is your 1 step ahead forecast),
  3. then you use $[Y_2,..., Y_t,\hat{Y}_{t+1}]$ to calculate $\hat{Y}_{t+2}$, then $[Y_3,..., Y_t,\hat{Y}_{t+1}, \hat{Y}_{t+2}]$ to calculate $\hat{Y}_{t+3}$, and so on...until you reach $\hat{Y}_{t+n}$.

(Here $Y$ are actual values and $\hat{Y}$ are forecast values.)