I just started to learn time series (about time after avoiding it for very long).
I read through some short summaries and jumped straight to see if one can model time series using supervised learning, turns out there's 2 ways.
The way I want to ask is based on the famous Kaggle notebook on XGboost for Time Series. The notebook is clear and one see that he transformed (decomposed) each datetime record to its week, day, month etc to somewhat capture trends/seasonal.
A short pseudo code is below:
def create_features(df, label=None):
"""
Creates time series features from datetime index
"""
df['date'] = df.index
df['hour'] = df['date'].dt.hour
...
X = df[['hour','dayofweek','quarter','month','year',
'dayofyear','dayofmonth','weekofyear']]
return ...
Now I tried it myself on a dataset, works fine and predictions made on Unseen Test Set is decent. I thought it's all good and tried to make more features, especially features like rolling mean
, rolling median
over past timeperiods and soon I ran into a big issue. During training and validation, we have the target values and hence these rolling
or lag
features can be found. But on unseen test set, there should be no target for us to model…how then should we go about forecasting, say 3 days ahead? Note that I did not notice this because the initial features are agnostic of target, as long as we have a datetime, we can create these features.
I hope someone can guide me to some good tutorials on this. I read the de-facto tutorials here but he is using the other way to model time series as supervised.
Best Answer
The rolling mean or lagged target values will always belong to past. If you plan to predict $t+k$ from up to day $t$ features, the rolling mean, however long it is, should contain the mean of the target values until day $t$.
At any arbitrary time $t$ in the test set, you can assume you have the data available until (and including) day $t-k$; even if day $t-k$ is in the test set. You should preprocess your test set accordingly.