Tools/languages/techniques I am using
- python
- scikit-learn
- different regression models (only linear regression is shown here for simplicity)
I am working on a regression problem. The data I have is time-series hourly consumption data and I am trying to make a step-ahead prediction.
I first prepared the data and made sure no data from the future is spilled into the training data. So for consumption at a certain hour (h0), the record will look as follows
feature1 | feature2 | target |
---|---|---|
h-2 | h-1 | h0 |
Where h-1 and h-2 are the previous two hours.
Note
I am adding two hours here for simplicity. However, in reality, I am
using different lag values and moving averages as features.
I trained the model and then applied the predict function to test data.
After that, I plotted the actual vs prediction (y_test vs y_predict), but it seems that there is some shift where the prediction is shifted by one hour in the future as you can see below
I tried to shift the prediction back by one hour the performance difference was huge
R2 increased from 0.64 to 0.89 (39% enhancement)
RMSE dropped from 1003 to 536.8 (46.5% enhancement)
My Question
- What could I be doing wrong?
- Am I doing something wrong or could this shift be an indication of something else?
Best Answer
This shift is an indication of a very strong correlation with the previous lag
h-1
and a low correlation with other feature variables.In other words, the model is mainly using
h-1
to estimate the current hour consumptionh
.While this can lead to acceptable results (and sometimes really good results as well) in terms of
R2
andRMSE
. It also means that the model is not really better than a baseline model that just usesh-1
to estimateh
(i.e.f(h) = h-1
)In this case, a machine learning model is just adding complexity with no clear improvement in performance. Nothing smart going on here
This video from Marco Peixeiro, the author of the book Time Series Forecasting in Python discusses this exact problem as well