I'm developing an ML-based model to forecast the daily sales of a whole month.
This model takes as input a set of precomputed time series features: day_of_week
, day_of_month
, day_of_year
, week_of_year
, month
and so many more. Additionally, the time series have an strong month seasonal pattern, and the patterns might greatly differ from one month to another.
The problem is that I've been experiencing a high variability in the hyperparameters of the model, depending the chosen validation set.
Let's say I want to forecast July-2019
, then I tried using different months, starting from July-2018
to June-2019
as validation set, finding a very different configuration of hyperparameters in each. I think this is due to the changing sales pattern between months.
For these reasons my intuition tell me to use June-2018
as validation set, as it is more "representative" of what my testing set would look like. However, It also seems that I'm loosing 11 months of data to validate the model.
Which approach for selecting the validation set you would recommend in this problem?
Best Answer
I also came across this problem when working on a forecasting project.
First say you are doing a grid search of your hyper-parameters and you have a set of parameters you want to test.
Because this is a time series dataset, we want to always predict in the future. Now depending on how many "folds" you wish to do, you can compute the CV error like so:
Do this for each hyperparameter in your search space. Choose the one that gives the least CV error.
After you have chosen the hyperparameter, you can fit the model on all the data except the month that you want to forecast for. Use the fitted model to forecast for the required month.
HTH.