I am familiar with "regular" cross-validation, but now I want to make timeseries predictions while using cross-validation with a simple linear regression function.
I write down a simple example, to help clarify my two questions: one about the train/test split, one question about how to train/test for models when the aim is to predict for different n, with n the steps of prediction, in advance.
(1) The data
Assume I have data for timepoints 1,…,10 as follows:
timeseries = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]
(2) Transforming the data into a format useful for supervised learning
As far as I understand, we can use "lags", i.e. shifts in the data to create a dataset suited for supervised learning:
input = [NaN,0.5,0.3,10,4,5,6,1,0.4,0.1]
output/response = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]
Here I have simply shifted the timeseries by one for creating the output vector.
As far as I understand, I could now use input as the input for a linear regression model, and output for the response (the NaN could be approximated our replaced with a random value).
(3) Question 1: Cross-validation ("backtesting")
Say I want to do now 2-splits, do I have to shift the train as well as the test sets?
I.e. something like:
Train-set:
Independent variable: [NaN,0.5,0.3,10,4,5]
Output/response variable:[0.5,0.3,10,4,5,6]
Test-set:
Independent variable: [1,0.4,0.1]
Output/response variable:[0.4,0.1,0.9]
(ii) Question 2: Predicting different lags in advance:
As obvious, I have shifted dependent to independent variables by 1. Assuming now I would like to train a model which can predict 5 time steps in advance — can I keep this lag of one, and nevertheless use the model to predict n+1,…,n+5,… or do I change the shift from independent to dependent variable to 5? What exactly is the difference?
Best Answer
For the first question, as Richard Hardy points out, there is an excellent blog post on the topic. There is also this post and this post which I have found very helpful.
For the second question, you need to take into account the two basic approaches to multistep times series forecasting: Recursive forecasting and direct forecasting:
Note that Hyndman's blog post on cross validation for time series covers both one step ahead and direct forecasting.
To clarify recursive forecasting (based on the comments):
(Here $Y$ are actual values and $\hat{Y}$ are forecast values.)