Solved – Selecting Link Function for Negative Binomial GLM

count-datageneralized linear modellink-functionnegative-binomial-distributionr

I'm trying to model insect abundance data with a variety of vegetation/site related covariates. Because it is count data that is over-dispersed, I've decided to use the negative binomial distribution. At first I was under the misapprehension that that was the link function, but in modeling with glm.nb, I'm prompted to select a link function. However, the options are limited to log, sqrt, and identity.

I can't find a good explanation of 1) why those three are the only possibilities with glm.nb, or 2) how to conceptualize which is most appropriate for my analysis. Using AICctab in R shows the log function is the best fit, though sqrt is almost indistinguishable. But the plots for the identity link look too good to be true (all points fall within the error bars, each treatment group is distinct, etc). But as far as I know, neither of these are scientifically informed ways to make the decision.

Other reading (eg this response) gives me the impression that I should match the properties of the link function to the response distribution and what I know of its properties. But neither the log nor sqrt seems to match what I know about my distribution (can't be negative, only yields integers). But the log function must match the negative binomial somehow, since it's the default link function for glm.nb.

Best Answer

First, you need to understand better what link functions are. Then, maybe look at what others are doing in your field, for instance this paper.

Then, you have count data, and for such data the most natural link function is the log link function. See for example Goodness of fit and which model to choose linear regression or Poisson. So, unless you have very strong reasons otherwise, you should start out with the log link function.

Related Solutions

Solved – Calculation of canonical link function in GLM

The variance function for the Bernoulli variable is $V(\mu) = \mu(1-\mu)$. We easily check that with the canonical link $g(\mu) = \log \frac{\mu}{1-\mu} = \log \mu - \log(1-\mu)$ then $$g'(\mu) = \frac{1}{\mu} + \frac{1}{1-\mu} = \frac{1 - \mu + \mu}{\mu(1-\mu)} = \frac{1}{\mu(1-\mu)} = \frac{1}{V(\mu)}.$$

For the general case one derives from the definition that $$E(Y) = \mu = b'(\theta) \quad \text{ and } \quad \text{Var}(Y) = b''(\theta) a(\psi),$$ see e.g. page 28-29 in McCullagh and Nelder. With $g$ the canonical link we have $\theta = g(\mu) = g(b'(\theta))$, and the variance function is defined as $b''(\theta)$, which in terms of $\mu$ becomes $$V(\mu) = b''(g(\mu)).$$ By differentiation of the identity $\theta = g(b'(\theta))$ we get $$1 = g'(b'(\theta)) b''(\theta) = g'(\mu) V(\mu),$$ which gives the general relation between the canonical link function and the variance function.

In the construction of quasi-likelihood functions it is natural to start with the relation between the mean and the variance, given in terms of the variance function $V$. In this context the anti-derivative of $V(\mu)^{-1}$ can be interpreted as a generalization of the link function, see, for instance, the definition of the (log) quasi-likelihood on page 325 (formula 9.3) in McCullagh and Nelder.

Solved – Negative binomial regression in R allowing for correlation between dispersion & regression coefficients

I haven't found another R package which does this, but I have written code which, based on the maximum likelihood estimates of a model fitted with glm.nb, calculates the full variance covariance matrix using the observed information matrix.

Comparing to values from SAS this appears to match, but if anyone spots an error or finds that it does not match the variance covariance matrix from SAS or Stata, please add a comment to this answer.

glm.nb.cov <- function(mod) {
  #given a model fitted by glm.nb in MASS, this function returns a variance covariance matrix for the
  #regression coefficients and dispersion parameter, without assuming independence between these
  #note that the model must have been fitted with x=TRUE argument so that design matrix is available

  #formulae based on p23-p24 of http://pointer.esalq.usp.br/departamentos/lce/arquivos/aulas/2011/LCE5868/OverdispersionBook.pdf
  #and http://www.math.mcgill.ca/~dstephens/523/Papers/Lawless-1987-CJS.pdf

  k <- mod$theta
  #p is number of regression coefficients
  p <- dim(vcov(mod))[1]

  #construct observed information matrix
  obsInfo <- array(0, dim=c(p+1, p+1))

  #first calculate top left part for regression coefficients
  for (i in 1:p) {
    for (j in 1:p) {
      obsInfo[i,j] <- sum( (1+mod$y/mod$theta)*mod$fitted.values*mod$x[,i]*mod$x[,j] / (1+mod$fitted.values/mod$theta)^2  )
    }
  }

  #information for dispersion parameter
  obsInfo[(p+1),(p+1)] <- -sum(trigamma(mod$theta+mod$y) - trigamma(mod$theta) -
                                 1/(mod$fitted.values+mod$theta) + (mod$theta+mod$y)/(mod$theta+mod$fitted.values)^2 - 
                                 1/(mod$fitted.values+mod$theta) + 1/mod$theta)

  #covariance between regression coefficients and dispersion
  for (i in 1:p) {
    obsInfo[(p+1),i] <- -sum(((mod$y-mod$fitted.values) * mod$fitted.values / ( (mod$theta+mod$fitted.values)^2 )) * mod$x[,i] )
    obsInfo[i,(p+1)] <- obsInfo[(p+1),i]
  }

  #return variance covariance matrix
  solve(obsInfo)
}

Related Question