ARIMA accuracy measures, rolling forecast

accuracyaicarimaforecastingmoving window

Regarding ARIMA model selection and especially accuracy measures several questions came into my mind. To shortly summarize, in my understanding, after necessary transformations/differencing, p and q values of an ARIMA model can be estimated by minimizing AIC/AICc/BIC and after that, coefficients are calculated with least squares/max likelihood optimization ('fit' method in different frameworks). Hence, we got the model fully specified. Furthermore let's assume that residuals look like white noise, passes LB test, so that we have everything to say the selected model is the best (is it true?).

I see multiple options to check the model forecast accuracy:

Option 1: We can make train/test sets, after model fitting on train set we produce multi-step forecasts on horizon of test set (lets say h=10) and compare true values of test set and forecast MAPE/RMSE/etc.

Option 2: We can make train/test sets, after model fitting on train set we produce rolling forecasts on horizon of test set (lets say h=10) and compare true values of test set and forecast MAPE/RMSE/etc. By rolling forecast I mean first we make a one-step forecast (first value after train set), then adding that as an observation, refitting model (but no recalculate p,d,q) on 'new' train set (original train set + first forecast). Then make one-step forecast again, etc. (it is cross-validation if I am right).

My questions:

Is it 100% sure that model selected by minimizing AIC/AICc/BIC will provide the best forecasts (in terms of MAPE/RMSE/etc.)?
Why do not we select the best model by comparing MAPE/RMSE/etc. on forecasts on test set instead of minimizing AIC/AICc/BIC?
In case of option 1, differences will be definitely larger as forecast converges to the mean (for example) in case of a longer multi-step forecast. Then, why would we use Option 1 (accuracy estimation based on multi-step forecasts) instead of Option 2 (accuracy estimation based on rolling forecasts)? I think accuracy estimation based on multi-step forecasts is only required and useful if there is a demand for multi-step forecasts when using the model later. If we have to produce only one-step forecasts in the future, accuracy estimation based on rolling forecasts is more valid and interpretable. Am I right?

Sorry if some (or all) questions are dumb :/

Thank you very much!

Best Answer

If I understood your main question it is about why not use cross-validation performance metrics for model selection? The short answer is that by doing so you would erase the distinction between CV and in-sample model fit metrics. Yet, effectively, many practitioners do exactly that.

The idea of CV is that you try to assess the model performance when it is applied to data that was not used to train it. The problem is that in reality you got the full data set, then you divide it into train and test. So, you actually saw all data already. we play a game of pica boo: pretend that we didn't see the test set and use it to emulate out of sample performance. However, some practitioners automate the whole process to such a degree, that they run in-sample training, then CV, get the metrics then rinse and repeat to hundreds of model speicifcations to select the best one. When they do this, what is the real distinction between CV and in-sample? None, effectively. So, you basically are left with in-sample fit.

Now, the nuance here is that with time-series models such as ARIMA, what we call in-sample metrics, such as AIC, aren't really in-sample: they're obtained from one-step ahead predictions, but it's a minor technicality for the discussion we have here

Related Solutions

Solved – Rolling forecasts: training versus forecast accuracy evaluation

For a given functional form (e.g. for a given order of ARIMA model), estimating the model using all available data is more efficient than estimating it on a subset of the data. This holds if the data is generated by a process that does not change in time. If, on the other hand, the data generating process itself evolves over time, "old" data may be unrepresentative for a "late" period in the sample, and "new" data may be unrepresentative for an "early" period in the sample. Then discounting or completely dropping early observations may help capture the recent state of the data generating process, which should be useful for forecasting the yet-unobserved data. In other words, rolling windows may come in handy. They may also help assess whether a model estimated on an "early" subsample continues to deliver stable forecasting performance throughout the rest of the sample. If it does not (e.g. the performance worsens with time), it is an indication that the data generating process may be evolving over time.
See 1. for a theoretical argument. I cannot offer empirical evidence, though.
I think this strategy would be more relevant for model selection (e.g. selection of the AR and MA orders in an ARMA model) rather than estimation of a model that has a fixed functional form (e.g. fixed ARMA orders). This is because you would like to use all available data for estimating the model once its functional form has been selected. (Omitting some data is generally inefficient.)

Solved – Rolling Forecast Re-training Step Concept

No, we don't re-train the model. Here is what the help page ?Arima say for the model parameter:

If model is passed, this same model is fitted to ‘x’ without re-estimating any parameters.

Here is an example:

# library(forecast)
# model.1 <- auto.arima(AirPassengers[1:24])
# model.1
Series: AirPassengers[1:24] 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1     ma1  intercept
      0.4137  0.6353   133.3991
s.e.  0.2091  0.1479     6.2032

sigma^2 estimated as 129.4:  log likelihood=-93
AIC=193.99   AICc=196.1   BIC=198.7

# model.2 <- Arima(AirPassengers[1:48],model=model.1)
# model.2
Series: AirPassengers[1:48] 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1     ma1  intercept
      0.4137  0.6353   133.3991
s.e.  0.0000  0.0000     0.0000

sigma^2 estimated as 385.6:  log likelihood=-211.61
AIC=425.22   AICc=425.31   BIC=427.09

We note:

The estimated coefficients are the same. (No surprise, since they are not re-estimated.)
The standard errors are all zero. (I'd assume they are manually set this way, since they don't make any sense and would not be connected to the new data.)
The estimated residual variance $\sigma^2$ has changed. This makes sense, since prediction intervals for non-re-restimated parameters will be larger than for re-estimated parameters, since the parameters don't fit as well as re-estimated ones would have.
The log-likelihood and information criteria change, since they are all related to $\sigma^2$.

Now, if we forecast, we of course get different values, since in each case the last observations we autoregress on are different:

# forecast(model.1,h=6)$mean
Time Series:
Start = 25 
End = 30 
Frequency = 1 
[1] 150.1410 140.3254 136.2646 134.5846 133.8896 133.6020

# forecast(model.2,h=6)$mean
Time Series:
Start = 49 
End = 54 
Frequency = 1 
[1] 187.6868 155.8583 142.6906 137.2431 134.9894 134.0570

As to why we would not re-estimate the model after getting new data... I also don't see a really good reason. Perhaps in specific situations you might have performance issues. You may assume that a few more data points won't change the parameters a lot, especially if you already have a long time series with thousands of observations - in which case re-estimating would take some time, too.

Best Answer

Related Solutions

Solved – Rolling forecasts: training versus forecast accuracy evaluation

Solved – Rolling Forecast Re-training Step Concept

Related Question