Solved – Choosing regressors for inclusion in regression with ARMA errors

arimafeature selectionmodel selectionregressiontime series

I would like to conduct a forecast based on a time series ARIMA-model with multiple exogenous variables. My time series is monthly unemployment data (in percentage) during several years and my regressors are continuous values of viewership Wikipedia traffic data on several Wikipedia articles. Both, the time series and the regressors, have the same length.

How to choose the right regressors to include in the model? Using auto.arima and forecast functions from the "forecast" package in R, my first attempt was to order the regressors according to the best resulting MAE when using each one individually. So, I start by using only 1 regressor (the best MAE), then I add the second best regressor, etc. Nevertheless, this post suggests to choose regressors according to significance but this post by Rob Hyndman suggests using AIC.

How should I proceed? How do I accept/reject regressors?

Best Answer

The gold standard in time series model selection is to use a holdout sample. Hold out the last few months of data, fit the different models (with different combinations of regressors) to the data before that, forecast into your holdout sample and pick the model with the lowest forecast error - MAE or MSE.

That said, I would expect readership numbers of different Wikipedia articles to be correlated, especially if used as a proxy for "has a lot of time on his hands". So you might want to look at dimension reduction techniques, like principal components analysis (PCA) or similar, to reduce your regressors to only the first few principal components. Fewer orthogonal regressors will yield a more stable model and probably better forecasts. (The problem is that interpretability suffers.)

Related Question