I have done a training set to fit different ARIMA models and then a test set to assess their performance (with R). From what I understood, I can use the AICc to determine the best model by choosing the one with the smallest AICc, but the differencing order of the models has to be the same to be able to compare them. However I can also use the RMSE to choose the best model and different differencing orders don't matter. However, in all my models d=1.

If small values of AICc tend to give better models and if the smaller the RMSE is the better the model is, then models with the smallest AICc should have the smallest RMSE? In my case, models with smaller AICc have greater values of RMSE than models with greater AICc. How should I decide which is the best model then?

Here I show the different ARIMA models with the respective AICc, p-value of the residuals of the Ljung-Box test, the RMSE and the MAPE.

```
AICc p-value RMSE MAPE
ARIMA (2,1,2) ~ 515.28 ~ 0.07054 ~ 1.1537 ~ 13.812
ARIMA (2,1,1) ~ 517.91 ~ 0.1145 ~ 1.0441 ~ 13.925
ARIMA (1,1,2) ~ 517.9 ~ 0.1169 ~ 1.0667 ~ 14.217
ARIMA (1,1,1) ~ 516.22 ~ 0.1732 ~ 1.1122 ~ 14.848
ARIMA (2,1,0) ~ 537.3 ~ 0.0074 ~ 0.9066 ~ 12.083
ARIMA (0,1,2) ~ 519.59 ~ 0.1004 ~ 0.9431 ~ 12.676
ARIMA (0,1,1) ~ 537.5 ~ 0.0007 ~ 0.9030 ~ 12.006
ARIMA (1,1,0) ~ 544.32 ~ 0.0006 ~ 0.8961 ~ 11.735
ARIMA (0,1,0) ~ 549.08 ~ 0.0006 ~ 0.8963 ~ 11.747
ARIMA (3,1,2) ~ 521.84 ~ 0.0368 ~ 1.0181 ~ 13.527
ARIMA (2,1,3) ~ 521.6 ~ 0.0432 ~ 1.0275 ~ 13.632
ARIMA (3,1,3) ~ 511.6 ~ 0.1617 ~ 1.0945 ~ 14.699
ARIMA (3,1,1) ~ 519.91 ~ 0.0800 ~ 1.1116 ~ 14.815
ARIMA (1,1,3) ~ 519.78 ~ 0.05345 ~ 0.9913 ~ 13.191
```

I have to say that auto.arima() with stepwise=FALSE, approximation=FALSE and seasonal=FALSE has chosen ARIMA(2,1,2) but it produces NaNs.

Should I first start by rejecting those models which p-value < 0.05? And then how should I decide the best model? Any suggestions of which model would you choose with these given values?

## Best Answer

The AIC should be calculated from residuals using models that control for intervention administration, otherwise the intervention effects are taken to be Gaussian noise, underestimating the actual model's autoregressive effect and thus miscalculates the model parameters which leads directly to an incorrect error sum of squares and ultimately an incorrect AIC. Most SE responders do not point out this assumption when they promote simple descriptive statistics such as AIC and RMSE.

The quick answer is you should use neither unless you are addressing the question of identifying and remedying the effects of unspecified deterministic/exogenous structure

See @AdamO's insightful response to this question Interrupted Time Series Analysis - ARIMAX for High Frequency Biological Data?

"The correlogram should be calculated from residuals using a model that controls for intervention administration, otherwise the intervention effects are taken to be Gaussian noise, underestimating the actual autoregressive effect."