Solved – How to properly do variable and model selection using auto.arima function

arimamodel selectionrtime series

I'm fitting ARIMA models to two different data sets (different metrics of fish abundance and distribution from two different sites) to see which model orders and covariates best describe the data from each site and would be good to forecast.

To do so, I'm using the auto.arima function. I'm running auto.arima with different combinations of covariates and looking at the AICc. I fixed d=1 so I know the input data is always the same, thus enabling to compare models using AICc.

The orders of the ARIMA output are typically different depending on the covariate(s) I include. Am I doing this right? Should I just fix the orders p, d and q of the ARIMA and then evaluate the different combinations of covariates.

Or am I totally wrong and I should just run auto.arima() with all the possible covariates in xreg and see what comes out? I tried this and I got a coefficient for each variable but I'm not sure if that means all variables are important or if auto.arima is forcing the variables to be included in the final model.

Best Answer

auto.arima()'s help page does not say whether or how xreg regressors are selected. I strongly suspect that xreg regressors are always included, whether or not they improve AICc or fit or are significant or not. Running

auto.arima(rnorm(100),xreg=rnorm(100))

multiple times confirms this - the regressor is always included, even if it is, like here, utterly unrelated to the time series.

It's not surprising that your ARIMA model changes if you include different regressors. After all, auto.arima() fits a regression with ARIMA errors (note: this is not an ARIMAX model!), so if you include different regressors, you feed a different residual time series to the ARIMA model. So getting different ARIMA orders makes perfect sense.

I'd propose that you actually regress your time series on the regressors, then look at the time series of residuals - this is what is fitted using ARIMA. Look also at ACF/PACF plots. Do the fitted ARIMA orders make sense for the residual time series? Do they make biological sense? For instance, if your model is well-specified, there may not be a good reason for moving average terms.

So selecting ARIMA orders based on AICc makes sense, but you will need to do something else to select your regressors. This will depend on what you are actually interested in. If you are interested in inference, then you should have prespecified your model and not be fitting different models, anyway. Stepwise methods for selecting covariates, whether based on AIC or significance, will render p-values invalid.

If you have enough data and are interested in prediction, you could do a forecasting test. Hold out the last couple of data points. Run auto.arima() on the rest, using your subset of regressors. Then forecast into the holdout sample. Do this for different choices of regressor sets, and check which one gives the lowest mean squared error. It is said that "the proof of the pudding is in the eating", and I firmly believe that "the proof of the model is in the prediction".

Related Solutions

Solved – Variable selection and forecasting with regARIMA models

The regression coefficients are optimized along with the ARMA coefficients. So you will not get the same answer if you fit the models separately.
You probably read that selecting variables for prediction is not the same as testing their significance. In other words, if prediction is your goal, then doing tests will not necessarily give you the best results. Alternatives are the AIC, or some form of cross-validation or validation sample. By default, auto.arima() uses the AICc for the ARMA error (but does no variable selection on the regression variables -- it uses all you provide).
Yes, the prediction is the sum of the regression models and the ARMA error. This is discussed, for example, in chapter 9 of my new book.

ARIMA – Should Auto.arima in R Report Models with Higher AIC, AICC and BIC

auto.arima uses some approximations in order to speed up the processing. The final model is fitted using full MLE, but along the way the models are estimated using CSS unless you use the argument approximation=FALSE. This is explained in the help file:

approximation If TRUE, estimation is via conditional sums of squares and the information criteria used for model selection are approximated. The final model is still computed using maximum likelihood estimation. Approximation should be used for long time series or a high seasonal period to avoid excessive computation times.

The default setting is approximation=(length(x)>100 | frequency(x)>12), again this is specified in the help file. As you have 17544 observations, the default setting gives approximation=TRUE.

Using the approximations, the best model found was a regression with ARIMA(5,1,0) errors with AICc of 2989.33. If you turn the approximations off, the best model has ARIMA(2,1,1) errors with an AICc of 2361.40.

> fitauto = auto.arima(reprots[,"lnwocone"], approximation=FALSE,
                xreg=cbind(fourier(reprots[,"lnwocone"], K=11),
                reprots[,c("temp","sqt","humidity","windspeed","mist","rain")]),
                start.p=1, start.q=1, trace=TRUE, seasonal=FALSE)
> fitauto
Series: reprots[, "lnwocone"] 
ARIMA(2,1,1) with drift         
...
sigma^2 estimated as 0.08012:  log likelihood=-1147.63
AIC=2361.27   AICc=2361.4   BIC=2617.76

Best Answer

Related Solutions

Solved – Variable selection and forecasting with regARIMA models

ARIMA – Should Auto.arima in R Report Models with Higher AIC, AICC and BIC

Related Question