Solved – Sample size for best forecasting ARIMA model

arimaforecastingrsample-sizetime series

How can we decide the size or portion of the data given to get the ARIMA that has the best forecasting properties?

I mean, for example, we have a hourly series with over 28.000 elements.

Which is the criteria that tells us: do ARIMA over last 100 elements, or 250 last elements, so the ARIMA we get is better for forecasting?
I am interested in short time prediction, like for 24 hours.

I read everywhere but found no criterion yet.

Best Answer

A good rule of thumb is: more is usually better.

Then again, more may not always be better. For instance, your data-generating process may have changed strongly over time, so that the data from before the change may not reflect current and future dynamics of your series any more. In such a case, it may indeed be better not to use your full dataset. (You may want to look at our questions tagged if you suspect something like this to be going on.)

Overall, the best way to assess almost anything about a forecast is to use a holdout sample. Hold out the last part of your data, say the last week (168 hours). Fit a model to the $n$ historical periods before that. Forecast out 24 hours. Note the error. Move the entire setup forward one hour (fit with $n$ historical periods, forecast 24 hours ahead, note the error). Do this until you have gone through your holdout sample. You should now have 168-24+1=145 errors. Do this for various "reasonable" values of $n$. Pick the one that yields the lowest error.

You will need to specify "reasonable" values of $n$, best based on your prior knowledge. Alternatively, just pick some numbers than make sense. You will also need to specify the forecast accuracy measure you want to use. I'd recommend the Mean Squared Error (MSE) if you are looking for unbiased point forecasts.

This section in a free online forecasting textbook is very helpful.


That said, 28000 hourly data points correspond to over three years of data. Without knowing anything else about your data, I suspect that there may be multiple sources of seasonality involved, like intra-daily, intra-weekly (e.g, for retail, call center or electricity demands, which show strong weekly patterns), or intra-yearly (like temperatures or weather information more generally). (S)ARIMA is not really made for handling multiple seasonalities. If you suspect that your data exhibits complex seasonalities, it will probably be more useful if you use a dedicated model to model these than if you optimize the history used by an inappropriate model like ARIMA. This earlier question may be helpful, as may other questions in the time-series tag on "complex seasonalities".