Solved – GLM model selection using AICc with Tweedie distribution

aicgeneralized linear modelmodel selectionrtweedie-distribution

I have two questions regarding use of Tweedie GLM in R. I am new in using this distribution and despite a thorough search on different forums, I could not find my answers.

I am now running several GLM models using Tweedie distribution because of compound Poisson-Gamma distribution of my data with a spike around zero.

I can successfully run the models but firstly, AIC is not provided in the model results. There is a command for AIC calculation (AICtweedie) but I could not find anything calculating AICc (as my data points are few). Is this just matter of lack of enough interest in Tweedie that such a command doesn't exist or there is some logical or technical constraints behind it?

Second, use of packages such as MuMin or glmulti is not possible when using Tweedie distribution (because AIC values are not calculated in Tweedie models). Has anyone got any similar packages to run multiple models and rank them based on AIC(c) when using Tweedie? Running around 500 models and type down the AIC values is not easily possible.

Best Answer

You can use AICtweedie directly in MuMIn's functions, just specify it as a rank argument. Alternatively, you could write a wrapper around AICtweedie.

Related Solutions

Solved – How to properly do variable and model selection using auto.arima function

auto.arima()'s help page does not say whether or how xreg regressors are selected. I strongly suspect that xreg regressors are always included, whether or not they improve AICc or fit or are significant or not. Running

auto.arima(rnorm(100),xreg=rnorm(100))

multiple times confirms this - the regressor is always included, even if it is, like here, utterly unrelated to the time series.

It's not surprising that your ARIMA model changes if you include different regressors. After all, auto.arima() fits a regression with ARIMA errors (note: this is not an ARIMAX model!), so if you include different regressors, you feed a different residual time series to the ARIMA model. So getting different ARIMA orders makes perfect sense.

I'd propose that you actually regress your time series on the regressors, then look at the time series of residuals - this is what is fitted using ARIMA. Look also at ACF/PACF plots. Do the fitted ARIMA orders make sense for the residual time series? Do they make biological sense? For instance, if your model is well-specified, there may not be a good reason for moving average terms.

So selecting ARIMA orders based on AICc makes sense, but you will need to do something else to select your regressors. This will depend on what you are actually interested in. If you are interested in inference, then you should have prespecified your model and not be fitting different models, anyway. Stepwise methods for selecting covariates, whether based on AIC or significance, will render p-values invalid.

If you have enough data and are interested in prediction, you could do a forecasting test. Hold out the last couple of data points. Run auto.arima() on the rest, using your subset of regressors. Then forecast into the holdout sample. Do this for different choices of regressor sets, and check which one gives the lowest mean squared error. It is said that "the proof of the pudding is in the eating", and I firmly believe that "the proof of the model is in the prediction".

Generalized Linear Model – Significance of P-Value in Model with Lowest AIC

I'm not sure what the actual question is here. There seems to be some concern that p-values are "not significant" for the model with the lowest AIC. That is certainly possible, especially when you have missing data.

However, please try not to be too concerned with statistical significance. Since you are interested primarily in inference:

I want to have an explanatory model for an outcome, possibly might progress to have a predictive model later on.

...then it is crucially important that you select the variables for your model in a principled way. Any stepwise procedure is a bad approach to this. More generally any approach based solely on p-values is bad. The model has absolutely no idea which of your variables is the main exposure, competing exposures, potential confounders, colliders or mediators, and as such the model cannot tell you anything about the way these variables are related causally. Moreover, when you adjust for mediators, and colliders, and when you over-adjust for confounders, you can introduce severe bias in the estimates that are of primary concern. See the accepted answer here for further details on principled variable selection for inference:
How do DAGs help to reduce bias in causal inference?

Best Answer

Related Solutions

Solved – How to properly do variable and model selection using auto.arima function

Generalized Linear Model – Significance of P-Value in Model with Lowest AIC

Related Question