I dispute the assertions from several points of view:
i) While the canonical link may well be 'problematic', it's not immediately obvious that someone will be interested in that link - whereas, for example, the log-link in the Poisson is often both convenient and natural, and so people are often interested in that. Even so, in the Poisson case people do look at other link functions.
So we needn't restrict our consideration to the canonical link.
A 'problematic link' is not of itself a especially telling argument against negative binomial regression.
The log-link, for example, seems to be quite a reasonable choice in some negative binomial applications, for example, in the cases where the data might be conditionally Poisson but there's heterogeneity in the Poisson rate - the log link can be almost as interpretable as it is in the Poisson case.
By comparison, I use Gamma GLMs reasonably often, but I don't recall (textbook examples aside) ever having used its canonical link - I use the log-link almost always, since it's a more natural link to use for the kinds of problems I tend to work with.
ii) "Little seems to have been made ... in applications" may have been just about true in 1989, but I don't think it stands now. [Even if it did stand now, that's not an argument that it's a poor model, only that it's not been widely used - which might happen for all manner of reasons.]
Negative binomial regression has become more widely used as it's more widely available, and I see it used in applications much more widely now. In R, for example, I make use of the functions in MASS
that support it (and the corresponding book, Venables and Ripley's, Modern Applied Statistics with S, uses negative binomial regression in some interesting applications) -- and I've used some functionality in a few other packages even before I used it in R.
I would have used negative binomial regression more, even earlier, if it had been readily available to me; I expect the same is true of many people - so the argument that it was little used seems to be more one of opportunity.
While it's possible to avoid negative binomial regression, (say by using overdispersed Poisson models), or a number of situations where it really doesn't matter much what you do, there are various reasons why that's not entirely satisfactory.
For example, when my interest is more toward prediction intervals than estimates of coefficients, the fact that the coefficients don't change may not be an adequate reason to avoid the negative binomial.
Of course there are still other choices that model the dispersion (such as the Conway-Maxwell-Poisson that is the subject of the paper you mentioned); while those are certainly options, there are sometimes situations where I am quite happy that the negative binomial is a reasonably good 'fit' as a model for my problem.
Are all these uses and recommendations in error?
I really don't think so! If they were, it should have become reasonably clear by now. Indeed, if McCullagh and Nelder had continued to feel the same way, they had no lack of opportunity, nor any lack of forums in which to clarify the remaining issues. Nelder has passed away (2010), but McCullagh is apparently still around.
If that short passage in McCullagh and Nelder is all they have, I'd say that's a pretty weak argument.
What are the consequences of this problematic link?
I think the issue is mainly one of the variance function and the link function being related rather than unrelated (as is the case for pretty much all the other main GLM families in popular use), which makes the interpretation on the scale of the linear predictor less straightforward (that's not to say it's the only issue; I do think it's the main issue for a practitioner). It's not much of a deal.
By way of comparison, I see Tweedie models being used much more widely in recent times, and I don't see people concerning themselves with the fact that $p$ appears both in the variance function and the canonical link (nor in most cases even worrying much about the canonical link).
None of this is to take anything away from Conway-Maxwell-Poisson models (the subject of the Sellers and Shmueli paper), which are also becoming more widely used -- I certainly don't wish to take part in a negative binomial vs COM-Poisson shooting match.
I simply don't see it as one-or-the-other, any more than (now speaking more widely) I take a purely Bayesian nor purely frequentist stance on statistical problems. I'll use whatever strikes me as the best choice in the particular circumstances I am in, and each choice tends to have advantages and disadvantages.
I was hoping someone else would expand upon my comments in a full answer, but here is mine, incase anyone needs help with a similar question.
1. Response to what does it predict
Glm ''like'' regression predicts the mean value given the independent variables. Selecting response in predict function back transforms the prediction out of the link scale (inverses the link), so the prediction is on the same scale as the dependent variable. If you "overfit" the model you can return back the exact value of y, but when modeling you're trying to generalize and approximate. It may not be the same as the arithmetic mean because the mean is estimated using maximum likelihood and is conditional on the included independent variables.
2. Response to how to "know" if the gamma distribution is right for my data
The gamma distribution is very flexible and is, in fact, a series of distributions that changes shape depending on the response (variance changes with mean). The gamma distribution requires the data to be positive definite and continuous if you fit those conditions the gamma distribution will quite often fit. Additionally, the Pearson residuals should be normally distributed and show no signs of heteroscedasticity. For more on model evaluation see: https://www.crcpress.com/Extending-the-Linear-Model-with-R-Generalized-Linear-Mixed-Effects-and/Faraway/p/book/9781498720960
3. Response to what happens when I use the qgamma function?
I am not 100% sure why the person you borrowed your sample code from is using the qgamma function. However, it looks to me like they are trying to return the 90th percent confidence limit. This is the code I use to return back 95% confidence limits on predictions.
preds_link = predict(model, newdata = test_data,
type = "link",
se.fit = TRUE)# use the glm to make predictions, also provides std. error
critval <- 1.96 # critical value for approx 95% CI
upper_ci_link <- preds_link$fit + (critval * preds_link$se.fit)# estimate upper CI for prediction on link scale
lwr_ci_link <- preds_link$fit - (critval * preds_link$se.fit)# estimate lower CI for prediction on link scale
fit_link <- preds_link$fit# returns fited value
upper_ci <- model$family$linkinv(upper_ci_link)
lwr_ci <- model$family$linkinv(lwr_ci_link)
fit <- model$family$linkinv(fit)
preds_link = data.frame(fit,lwr_ci,upper_ci,
fit_link,lwr_ci_link,
upper_ci_link)# puts predictions, CI in a single dataframe
colnames(preds_link) = c("Prediction","LCL", "UCL",
"Link_Prediction","Link_LCL", "Link_UCL")# give variables logical names
Prediction, LCL, and UCL: Prediction and Confidence Limits on response scale (i.e., same scale as data)
Link_Prediction, Link_LCL, and Link_UCL: Prediction and Confidence Limits on link scale (i.e., log transformed)
If you wish to return back the 90th percent confidence value just change the critical value used to one corresponding to $\alpha$ = 0.10 instead of $\alpha$ = 0.05
Best Answer
I'm going to take a guess at where the problem you're having has arisen, and you can correct me if I have it wrong.
You define a fitted negative binomial and say:
Right, that all makes sense. Then comes
And this is where I think your confusion arises. The Gamma* distribution and the Gamma function are different things (though there's a connection between them!)
* Before anyone jumps on my bad English, I'm using "Gamma" function for $\Gamma(.)$ to distinguish it from the "gamma" function, $\gamma(.)$, and by extension, the same convention for the distribution, which is also often denoted $\Gamma$. (This seems to be fairly common usage, even if it doesn't strictly follow English rules, I think it's necessary to reduce potential confusion, especially in this question, where there's already confusion over different meanings of the word.)
The Gamma function is a function of a single argument as in your formula above.
The usual form of the Gamma distribution has two parameters. There's no inconsistency there, they're quite different objects.
The Gamma function:
$$\Gamma (t)=\int _{0}^{\infty }x^{{t-1}}e^{{-x}}\,{{\rm {d}}}x.$$
Note that this is just a function (though one that arises frequently):
Two related functions are the incomplete Gamma function and the incomplete gamma function:
$\Gamma (s,x)=\int _{x}^{{\infty }}t^{{s-1}}\,e^{{-t}}\,{{\rm {d}}}t,$ and
$\gamma (s,x)=\int _{0}^{x}t^{{s-1}}\,e^{{-t}}\,{{\rm {d}}}t.$
and where $\Gamma(t) = \lim_{x\rightarrow\infty} \gamma (t,x)$
The Gamma distribution
Consider constructing a cdf as follows:
$$F(x;k) = \gamma (k,x)/\Gamma(k)$$
This has the value $0$ at $x=0$ and approaches $1$ as $x\rightarrow\infty$.
The density can be obtained by differentiation, giving:
$$f(x;k)={\frac {x^{{k-1}}e^{{-{x}}}}{\Gamma (k)}}\quad {\text{ for }}x>0{\text{ and }}k>0.$$
So far this only has one parameter, the shape. We get the second parameter by adding a scaling factor; if $Z$ has the above one-parameter-Gamma form, let $X=\lambda Z$, and $X$ will be a two-parameter Gamma.
Beware: there are two common forms for the two parameter Gamma distribution]*, the 'rate form' and the 'scale form'. Both forms are given at the Wikipedia page on the Gamma distribution
* there's a third, slightly less common parameterization of the two parameter Gamma, used with GLMs -- the shape-mean form.
Here's the scale form for the density:
$f(x;k,\theta )={\frac {x^{{k-1}}e^{{-{\frac {x}{\theta }}}}}{\theta ^{k}\Gamma (k)}}\quad {\text{ for }}x>0{\text{ and }}k,\theta >0.$
Here's the rate form
$f(x;k,\beta )={\frac {\beta^{k}x^{{k-1}}e^{{-{x\beta }}}}{\Gamma (k)}}\quad {\text{ for }}x>0{\text{ and }}k,\beta >0.$
Here's a Gamma density with shape parameter 3 (and unit scale or rate):
When you use functions in a program for the cdf, pdf, quantile function and for random number generation, you have to make sure you're using the same convention (shape or rate) as the program!
In R, for example, one can do things with the Gamma distribution via functions like
dgamma
,pgamma
,qgamma
andrgamma
(which by default use the shape and rate parameterization, you need to use a named parameter to get the scale parameterization). On the other hand, for the Gamma function you'd use a call togamma
(one parameter).