auto.arima()
's help page does not say whether or how xreg
regressors are selected. I strongly suspect that xreg
regressors are always included, whether or not they improve AICc or fit or are significant or not. Running
auto.arima(rnorm(100),xreg=rnorm(100))
multiple times confirms this - the regressor is always included, even if it is, like here, utterly unrelated to the time series.
It's not surprising that your ARIMA model changes if you include different regressors. After all, auto.arima()
fits a regression with ARIMA errors (note: this is not an ARIMAX model!), so if you include different regressors, you feed a different residual time series to the ARIMA model. So getting different ARIMA orders makes perfect sense.
I'd propose that you actually regress your time series on the regressors, then look at the time series of residuals - this is what is fitted using ARIMA. Look also at ACF/PACF plots. Do the fitted ARIMA orders make sense for the residual time series? Do they make biological sense? For instance, if your model is well-specified, there may not be a good reason for moving average terms.
So selecting ARIMA orders based on AICc makes sense, but you will need to do something else to select your regressors. This will depend on what you are actually interested in. If you are interested in inference, then you should have prespecified your model and not be fitting different models, anyway. Stepwise methods for selecting covariates, whether based on AIC or significance, will render p-values invalid.
If you have enough data and are interested in prediction, you could do a forecasting test. Hold out the last couple of data points. Run auto.arima()
on the rest, using your subset of regressors. Then forecast into the holdout sample. Do this for different choices of regressor sets, and check which one gives the lowest mean squared error. It is said that "the proof of the pudding is in the eating", and I firmly believe that "the proof of the model is in the prediction".
I'm not sure what the actual question is here. There seems to be some concern that p-values are "not significant" for the model with the lowest AIC. That is certainly possible, especially when you have missing data.
However, please try not to be too concerned with statistical significance. Since you are interested primarily in inference:
I want to have an explanatory model for an outcome, possibly might progress to have a predictive model later on.
...then it is crucially important that you select the variables for your model in a principled way. Any stepwise procedure is a bad approach to this. More generally any approach based solely on p-values is bad. The model has absolutely no idea which of your variables is the main exposure, competing exposures, potential confounders, colliders or mediators, and as such the model cannot tell you anything about the way these variables are related causally. Moreover, when you adjust for mediators, and colliders, and when you over-adjust for confounders, you can introduce severe bias in the estimates that are of primary concern. See the accepted answer here for further details on principled variable selection for inference:
How do DAGs help to reduce bias in causal inference?
Best Answer
You can use
AICtweedie
directly in MuMIn's functions, just specify it as arank
argument. Alternatively, you could write a wrapper aroundAICtweedie
.