ARIMA accuracy measures, rolling forecast

accuracyaicarimaforecastingmoving window

Regarding ARIMA model selection and especially accuracy measures several questions came into my mind. To shortly summarize, in my understanding, after necessary transformations/differencing, p and q values of an ARIMA model can be estimated by minimizing AIC/AICc/BIC and after that, coefficients are calculated with least squares/max likelihood optimization ('fit' method in different frameworks). Hence, we got the model fully specified. Furthermore let's assume that residuals look like white noise, passes LB test, so that we have everything to say the selected model is the best (is it true?).

I see multiple options to check the model forecast accuracy:

Option 1: We can make train/test sets, after model fitting on train set we produce multi-step forecasts on horizon of test set (lets say h=10) and compare true values of test set and forecast MAPE/RMSE/etc.

Option 2: We can make train/test sets, after model fitting on train set we produce rolling forecasts on horizon of test set (lets say h=10) and compare true values of test set and forecast MAPE/RMSE/etc. By rolling forecast I mean first we make a one-step forecast (first value after train set), then adding that as an observation, refitting model (but no recalculate p,d,q) on 'new' train set (original train set + first forecast). Then make one-step forecast again, etc. (it is cross-validation if I am right).

My questions:

  • Is it 100% sure that model selected by minimizing AIC/AICc/BIC will provide the best forecasts (in terms of MAPE/RMSE/etc.)?

  • Why do not we select the best model by comparing MAPE/RMSE/etc. on forecasts on test set instead of minimizing AIC/AICc/BIC?

  • In case of option 1, differences will be definitely larger as forecast converges to the mean (for example) in case of a longer multi-step forecast. Then, why would we use Option 1 (accuracy estimation based on multi-step forecasts) instead of Option 2 (accuracy estimation based on rolling forecasts)? I think accuracy estimation based on multi-step forecasts is only required and useful if there is a demand for multi-step forecasts when using the model later. If we have to produce only one-step forecasts in the future, accuracy estimation based on rolling forecasts is more valid and interpretable. Am I right?

Sorry if some (or all) questions are dumb :/

Thank you very much!

Best Answer

If I understood your main question it is about why not use cross-validation performance metrics for model selection? The short answer is that by doing so you would erase the distinction between CV and in-sample model fit metrics. Yet, effectively, many practitioners do exactly that.

The idea of CV is that you try to assess the model performance when it is applied to data that was not used to train it. The problem is that in reality you got the full data set, then you divide it into train and test. So, you actually saw all data already. we play a game of pica boo: pretend that we didn't see the test set and use it to emulate out of sample performance. However, some practitioners automate the whole process to such a degree, that they run in-sample training, then CV, get the metrics then rinse and repeat to hundreds of model speicifcations to select the best one. When they do this, what is the real distinction between CV and in-sample? None, effectively. So, you basically are left with in-sample fit.

Now, the nuance here is that with time-series models such as ARIMA, what we call in-sample metrics, such as AIC, aren't really in-sample: they're obtained from one-step ahead predictions, but it's a minor technicality for the discussion we have here