Solved – When to use gamma GLMs

gamma distributiongeneralized linear model

The gamma distribution can take on a pretty wide range of shapes, and given the link between the mean and the variance through its two parameters, it seems suited to dealing with heteroskedasticity in non-negative data, in a way that log-transformed OLS can't do without either WLS or some sort of heteroskedasticity-consistent VCV estimator.

I would use it more for routine non-negative data modeling, but I don't know anyone else that uses it, I haven't learned it in a formal classroom setting, and the literature that I read never uses it. Whenever I Google something like "practical uses of gamma GLM", I come up with advice to use it for waiting times between Poisson events. OK. But that seems restrictive and can't be its only use.

Naively, it seems like the gamma GLM is a relatively assumption-light means of modeling non-negative data, given gamma's flexibility. Of course you need to check Q-Q plots and residual plots like any model. But are there any serious drawbacks that I am missing? Beyond communication to people who "just run OLS"?

Best Answer

The gamma has a property shared by the lognormal; namely that when the shape parameter is held constant while the scale parameter is varied (as is usually done when using either for models), the variance is proportional to mean-squared (constant coefficient of variation).

Something approximate to this occurs fairly often with financial data, or indeed, with many other kinds of data.

As a result it's often suitable for data that are continuous, positive, right-skew and where variance is near-constant on the log-scale, though there are a number of other well-known (and often fairly readily available) choices with those properties.

Further, it's common to fit a log-link with the gamma GLM (it's relatively more rare to use the natural link). What makes it slightly different from fitting a normal linear model to the logs of the data is that on the log scale the gamma is left skew to varying degrees while the normal (the log of a lognormal) is symmetric. This makes it (the gamma) useful in a variety of situations.

I've seen practical uses for gamma GLMs discussed (with real data examples) in (off the top of my head) de Jong & Heller and Frees as well as numerous papers; I've also seen applications in other areas. Oh, and if I remember right, Venables and Ripley's MASS uses it on school absenteeism (the quine data; Edit: turns out it's actually in Statistics Complements to MASS, see p11, the 14th page of the pdf, it has a log link but there's a small shift of the DV). Uh, and McCullagh and Nelder did a blood clotting example, though perhaps it may have been natural link.

Then there's Faraway's book where he did a car insurance example and a semiconductor manufacturing data example.

There are some advantages and some disadvantages to choosing either of the two options. Since these days both are easy to fit; it's generally a matter of choosing what's most suitable.

It's far from the only option; for example, there's also inverse Gaussian GLMs, which are more skew/heavier tailed (and even more heteroskedastic) than either gamma or lognormal.

As for drawbacks, it's harder to do prediction intervals. Some diagnostic displays are harder to interpret. Computing expectations on the scale of the linear predictor (generally the log-scale) is harder than for the equivalent lognormal model. Hypothesis tests and intervals are generally asymptotic. These are often relatively minor issues.

It has some advantages over log-link lognormal regression (taking logs and fitting an ordinary linear regression model); one is that mean prediction is easy.