Solved – How to properly do variable and model selection using auto.arima function

arimamodel selectionrtime series

I'm fitting ARIMA models to two different data sets (different metrics of fish abundance and distribution from two different sites) to see which model orders and covariates best describe the data from each site and would be good to forecast.

To do so, I'm using the auto.arima function. I'm running auto.arima with different combinations of covariates and looking at the AICc. I fixed d=1 so I know the input data is always the same, thus enabling to compare models using AICc.

The orders of the ARIMA output are typically different depending on the covariate(s) I include. Am I doing this right? Should I just fix the orders p, d and q of the ARIMA and then evaluate the different combinations of covariates.

Or am I totally wrong and I should just run auto.arima() with all the possible covariates in xreg and see what comes out? I tried this and I got a coefficient for each variable but I'm not sure if that means all variables are important or if auto.arima is forcing the variables to be included in the final model.

Best Answer

auto.arima()'s help page does not say whether or how xreg regressors are selected. I strongly suspect that xreg regressors are always included, whether or not they improve AICc or fit or are significant or not. Running

auto.arima(rnorm(100),xreg=rnorm(100))

multiple times confirms this - the regressor is always included, even if it is, like here, utterly unrelated to the time series.

It's not surprising that your ARIMA model changes if you include different regressors. After all, auto.arima() fits a regression with ARIMA errors (note: this is not an ARIMAX model!), so if you include different regressors, you feed a different residual time series to the ARIMA model. So getting different ARIMA orders makes perfect sense.

I'd propose that you actually regress your time series on the regressors, then look at the time series of residuals - this is what is fitted using ARIMA. Look also at ACF/PACF plots. Do the fitted ARIMA orders make sense for the residual time series? Do they make biological sense? For instance, if your model is well-specified, there may not be a good reason for moving average terms.

So selecting ARIMA orders based on AICc makes sense, but you will need to do something else to select your regressors. This will depend on what you are actually interested in. If you are interested in inference, then you should have prespecified your model and not be fitting different models, anyway. Stepwise methods for selecting covariates, whether based on AIC or significance, will render p-values invalid.

If you have enough data and are interested in prediction, you could do a forecasting test. Hold out the last couple of data points. Run auto.arima() on the rest, using your subset of regressors. Then forecast into the holdout sample. Do this for different choices of regressor sets, and check which one gives the lowest mean squared error. It is said that "the proof of the pudding is in the eating", and I firmly believe that "the proof of the model is in the prediction".