Solved – Gamma vs tweedie distribution for large productivity dataset

aicgamma distributiongeneralized-additive-modelresidualstweedie-distribution

I'm running some GAMs using the mgcv R package on a dataset with ~8.5k observations, where productivity is the response and environmental conditions are the covariates. However I am unsure of which distribution to use and was seeing some advice.

The productivity response is definitely not normally distributed, which I think is the mgcv default if you don't specify anything? I've been using the Gamma on the logic that it reduces the residual variation:

family=Gamma(link=log)

Another option for productivity seems to be the Tweedie distribution, as biomass is a positive continuous quantity which can be very small or 0 eg:

family=Tweedie(1.25,power(.5))

The Tweedie produces higher a deviance explained (by about 5-6%) and lower AIC (by about 1500 units) than Gamma. The residuals show less structure (i.e. more constant variance), but the total deviance is much higher (residual plots 1 = gamma, 2 = tweedie). I'm not sure which of these criteria are most important for choosing a distribution?

My logic was that reducing the residual structure is a better outcome for the key assumption of constant variance. Also the key results I'm interested in (comparing the relative contributions of different covariates) don't fundamentally change with the distribution, but the magnitudes of their effects on productivity are different.

Gamma residuals

Tweedie residuals

Best Answer

The question you need to ask yourself is if your response variable takes 0 values (not if it takes very small values). Normally if you have 0s on your data you should'nt be able to fit a gamma distribution.

I would suggest then trying lognormal, gamma and inverse normal, which are the most common positive distributions.