Time Series Forecasting Process With Regard to Training and Test Sets

arimaforecastingmodel selectiontime seriestrain

I'm a bit confused about the process order in doing proper time series analysis/forecasting. Is it:

Stationary/seasonality checks, do any transformations required
Candidate model selection using ACF, PACF, and auto.arima… using all data
Split data into training and test
Choose best model based off accuracy measures (RMSE…) and Information Criteria (AICc, BIC..)
etc

or:

Stationary/seasonality checks, do any transformations required
Split data into training and test
Candidate model selection using ACF, PACF, and auto.arima… using just training data
Choose best model based off accuracy measures (RMSE…) and Information Criteria (AICc, BIC..)
etc

Or neither. I've looked around for examples and read through textbooks but can't find a straight answer.

First post so sorry if this isn't the kind of thing that should be asked here/if it was not asked correctly.

Thanks!

Best Answer

I would second the recommendation for Hyndman-Athanasopoulos and note that there is a third edition (comment by Mehmet linked to second edition).

Time series forecasting is more of an art than a science sometimes but in general the second pipeline you described is better. The #1 pitfall with time series is look-ahead bias. This is where you use information to make a prediction which wouldn't have actually been available to you at the time.

Another related form of look-ahead bias is selecting features/models using testing data and so potentially overfitting to relationships which happen to hold true in the testing period.

You want to build features based on relationships learned in training and then see whether those relationships continue to hold true in the testing period. Even better you could have train/validation/testing periods for feature selection/model selection/model testing respectively.

Best Answer

Related Solutions

Solved – time series forecasting using auto.arima and exponential smoothing

Solved – ACF indicates non-stationarity but but time series plot looks stationary

Related Question