Solved – Log-linked Gamma GLM vs log-linked Gaussian GLM vs log-transformed LM

gamma distributiongeneralized linear modellink-functionmodel selectionr

From my results, it appears that GLM Gamma meets most assumptions, but is it a worthwhile improvement over the log-transformed LM? Most literature I've found deals with Poisson or Binomial GLMs. I found the article EVALUATION OF GENERALIZED LINEAR MODEL ASSUMPTIONS USING RANDOMIZATION very useful, but it lacks the actual plots used to make a decision. Hopefully someone with experience can point me in right direction.

I want to model the distribution of my response variable T, whose distribution is plotted below. As you can see, it is positive skewness:

Valid XHTML .

I have two categorical factors to consider: METH and CASEPART.
Note that this study is mainly exploratory, essentially serving as a pilot study before theorizing a model and performing DoE around it.

I have the following models in R, with their diagnostic plots:

LM.LOG <- lm(log10(T) ~ factor(METH) + factor(CASEPART), 
             data=tdat)

Valid XHTML

GLM.GAMMA <- glm(T ~ factor(METH) * factor(CASEPART), 
                 data=tdat, family="Gamma"(link='log'))

Valid XHTML

GLM.GAUS <- glm(T ~ factor(METH) * factor(CASEPART), 
data=tdat, family="gaussian"(link='log'))

Valid XHTML

I also attained the following P-values via Shapiro-Wilks test on residuals:

LM.LOG: 2.347e-11  
GLM.GAMMA: 0.6288  
GLM.GAUS:  0.6288

I calculated AIC and BIC values, but if I am correct, they don't tell me much due to different families in the GLMs/LM.

Also, I noted the extreme values, but I cannot classify them as outliers as there is no clear "special cause".

Best Answer

Well, quite clearly the log-linear fit to the Gaussian is unsuitable; there's strong heteroskedasticity in the residuals. So let's take that out of consideration.

What's left is lognormal vs gamma.

Note that the histogram of $T$ is of no direct use, since the marginal distribution will be a mixture of variates (each conditioned on a different set of values for the predictors); even if one of the two models was correct, that plot may look nothing like the conditional distribution.

Either model appears just about equally suitable in this case. They both have variance proportional to the square of the mean, so the pattern of spread in residuals against fit is similar.

A low outlier will fit slightly better with a gamma than a lognormal (vice versa for a high outlier). At a given mean and variance, the lognormal is more skew and has a higher coefficient of variation.

One thing to remember is that the expectation of the lognormal is not $\exp(\mu)$; if you're interested in the mean you can't just exponentiate the log scale fit. Indeed, if you are interested in the mean, the gamma avoids a number of issues with the lognormal (e.g. once you incorporate parameter uncertainty in $\sigma^2$ in the lognormal, you have prediction based on the log-t distribution, which doesn't have a mean. Prediction intervals still work fine, but that may be a problem for predicting the mean.

See also here and here for some related discussions.

Related Solutions

Solved – marginal effects of a GLM

The analogous marginal effect is the same linear model parameter from your general linear model for independent data. The interpretation differs slightly, in that gaussian GLMs (or OLS) estimate mean differences, whereas logistic regression (a type of binomial GLM) estimates a log odds ratio. The Gamma distribution uses an inverse link which gives rise to a harmonic mean difference. This might be a good way of comparing variability in the number of tasks a machine can perform per hour, and puts more or less influence on the same metrics that an arithmetic mean difference would. Poisson regression uses a log-link, and this estimates a geometric mean difference. For instance, in modeling count outcomes based outcomes over consistent time denominators, the geometric mean difference would be a relative rate.

In general any model for independent data will estimate a marginal effect. The difference arises when you consider models for dependent data. When you have repeated measures within an individual and use a conditional likelihood (e.g. random effects) to control for the (possibly) of hundreds of unmeasured variables, the fixed effects are conditional effects, or that is an expected "difference" (on some scale) within an individual having all fixed effects equal.

Solved – What are the assumptions of a Gamma GLM or GLMM for hypothesis testing

Are the residuals suppose to be normally distributed...

No. The distribution of $y\vert x$ is assumed to be gamma, and I don't think that $X - \mathbb{E}(X)$ is gamma for $X\sim \text{Gamma}(k,\theta)$. The deviance residuals, on the other hand, should be normal.

is heteroscedasticity a concern like the Gaussian (normal) distribution?

Heteroscedasticity should be observed since the variance changes with the mean.

Here is an example of Gamma regression in R.

library(tidyverse)

#Simulate
set.seed(0)
N <- 250
x <- runif(N, -1, 1)
a <- 0.5
b <- 1.2
y_true <- exp(a + b * x)
shape <- 2
y <- rgamma(N, scale = y_true/shape, shape = shape)

#Model
plot(x,y)
model <- glm(y ~ x, family = Gamma(link = "log"))
summary(model)

#Deviance residuals
ypred = predict(model)
res = residuals(model, type = 'deviance')
plot(ypred,res)
hist(res)

#Deviance GOF

deviance = model$deviance
p.value = pchisq(deviance, df = model$df.residual, lower.tail = F)
# Fail to reject; looks like a good fit.

For more on GLMs, I would recommend Extending the Linear Model by Faraway.

Best Answer

Related Solutions

Solved – marginal effects of a GLM

Solved – What are the assumptions of a Gamma GLM or GLMM for hypothesis testing

Related Question