Generalized Linear Models – Variance Functions for Poisson and Negative Binomial Distributions

generalized linear modelpoisson-regressionvariance

I'm having some trouble understanding how the variance functions of the Poisson or negative binomial tie in to the standard errors on the coefficients. I'm mentioning 2 models because I'm not sure if things work the same in both of them.

Say both the Poisson and negative binomial models have $\beta_0$: intercept, $\beta_1$: sex (where 1 is female and 0 is male).

The variance function for the Poisson is $\sigma^2=\lambda$. The variance function for the negative binomial is $\sigma^2=\mu+\frac{1}{\theta}\mu^2$, where $\theta$ is the scale parameter.

What I understand is that you can use the variance function to infer the variance in the groups of your data. Of course, you have your own data, so you could just take the variance of it, but you can also see what the model predicts for the variance. For example, in the Poisson model, the mean of the "female" group is $e^{\beta_0+\beta_1}$, and let's say you have the estimated coefficients, so you can just solve for the mean, which is equal to the variance. Similarly, the variance of the male group would be $e^{\beta_0}$.

For the negative binomial, I believe it's the same, but using the variance function for the negative binomial, the variance of the "female" group would be estimated as $e^{\beta_0+\beta_1}+\frac{1}{\theta}\cdot (e^{\beta_0+\beta_1})^2$, where you know the scale parameter $\theta$.

Now, how does the variance function relate to the standard error of the coefficients (e.g. $\beta_1$)? I've been trying to figure this out for a while. Is it possible to calculate the standard errors from the R output (i.e. given the coefficients and the scale parameter for the negative binomial case)? For a Poisson model, what is $\lambda$ for the standard error of a coefficient, and what is $\theta$ for the standard error of a coefficient in the negative binomial?

Any help would be appreciated!

Best Answer

Is it possible to calculate the [coefficient] standard errors from the R output?

It depends on what you mean by "the R output." It's not conceptually different for generalized linear models or ordinary least squares models. Maybe it's simplest to start thinking about this in the context of ordinary least squares regression. Consider the example from the lm() help page:

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)

If you then type lm.D9 at the prompt, you get the coefficient estimates

lm.D9

# Call:
# lm(formula = weight ~ group)
# 
# Coefficients:
# (Intercept)     groupTrt  
#       5.032       -0.371

If that's all you have to work with you can't get the standard errors of the coefficients even in this simple case. The standard errors of the coefficients are the square roots of the diagonal elements of the coefficient variance-covariance matrix. As this page (among many others) shows, that matrix is $\hat \sigma^2 (X'X)^{-1}$, where $\hat \sigma^2 $ is the estimated error variance and $X$ is the design matrix that submits the data to the model. Even for an intercept-only model the standard error of the mean isn't $\sqrt {\hat \sigma^2}$ by itself, as one might infer from the question as stated. It's $\sqrt {\hat \sigma^2/n}$, where $n$ is the number of observations.

R models typically allow for a vcov() function to report that matrix:

vcov(lm.D9)
#             (Intercept)    groupTrt
# (Intercept)  0.04849583 -0.04849583
# groupTrt    -0.04849583  0.09699167

sqrt(diag(vcov(lm.D9)))
# (Intercept)    groupTrt 
#   0.2202177   0.3114349

For many model types (like this one) you can get more information simply by asking for a summary() that directly reports the standard errors along with the coefficient estimates (not shown).

The principle is the same for generalized linear models like Poisson that are fit by maximum likelihood estimation (MLE). As this answer says:

the estimated standard errors of the MLE are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. In other words: The square roots of the diagonal elements of the inverse of the Hessian (or the negative Hessian) are the estimated standard errors.

That is, the variance-covariance matrix of the coefficient estimates is the inverse of the matrix of second derivatives of the log-likelihood with respect to the covariate values at the final solution. For ordinary linear regression under the classic assumption of independent, normally distributed errors having constant variance, that ends up identical to the formula above. For other generalized linear models, however, there is not typically a closed-form solution and the entire variance-covariance matrix must be estimated from the combination of the model and the data.

A Poisson model assumes that the variance around any estimate equals the mean. So the formula for the likelihood used to fit the model, where it includes a parameter $\lambda$ for the mean, inherently also includes the variance $\lambda$. For the Poisson example of the glm() help page:

counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
glm.D93
#
# Call:  glm(formula = counts ~ outcome + treatment, family = poisson())
# 
# Coefficients:
# (Intercept)     outcome2     outcome3   treatment2   treatment3  
#   3.045e+00   -4.543e-01   -2.930e-01    1.338e-15    1.421e-15  
# 
# Degrees of Freedom: 8 Total (i.e. Null);  4 Residual
# Null Deviance:        10.58 
# Residual Deviance: 5.129  AIC: 56.76
# 
sqrt(diag(vcov(glm.D93)))
# (Intercept)    outcome2    outcome3  treatment2  treatment3 
#   0.1708987   0.2021708   0.1927423   0.2000000   0.2000000

Again in this case, summary(glm.D93) will produce those values along with additional information about the model.

how does the variance function relate to the standard error of the coefficients?

In this Poisson case, it's mainly how the maximum-likelihood estimate $\hat \lambda$ of the mean also estimates the variance. For an intercept-only model with $n$ observations, this page shows that the standard error of the estimate of the mean is $\sqrt{\hat \lambda/n}$. This answer shows a similar intercept-only result for a negative binomial (in a different parameterization from yours). In both cases you have the inverse relationship of the coefficient standard error to $\sqrt{n}$.

For a single binary predictor and independent observations, as in your model, you essentially have 2 such intercept-only models in the Poisson case, or for the negative binomial when you pre-specify a known $\theta$. For more complicated regressions, in which you need to estimate the negative-binomial $\theta$ or have more predictors whose coefficients need to be estimated together, the standard errors also have to do with how precisely the model fits the data under the modeling assumptions.

Related Solutions

Regression – When Do Poisson and Negative Binomial Regressions Fit Same Coefficients?

You have discovered an intimate, but generic, property of GLMs fit by maximum likelihood. The result drops out once one considers the simplest case of all: Fitting a single parameter to a single observation!

One sentence answer: If all we care about is fitting separate means to disjoint subsets of our sample, then GLMs will always yield $\hat\mu_j = \bar y_j$ for each subset $j$, so the actual error structure and parametrization of the density both become irrelevant to the (point) estimation!

A bit more: Fitting orthogonal categorical factors by maximum likelihood is equivalent to fitting separate means to disjoint subsets of our sample, so this explains why Poisson and negative binomial GLMs yield the same parameter estimates. Indeed, the same happens whether we use Poisson, negbin, Gaussian, inverse Gaussian or Gamma regression (see below). In the Poisson and negbin case, the default link function is the $\log$ link, but that is a red herring; while this yields the same raw parameter estimates, we'll see below that this property really has nothing to do with the link function at all.

When we are interested in a parametrization with more structure, or that depends on continuous predictors, then the assumed error structure becomes relevant due to the mean-variance relationship of the distribution as it relates to the parameters and the nonlinear function used for modeling the conditional means.

GLMs and exponential dispersion families: Crash course

An exponential dispersion family in natural form is one such that the log density is of the form $$ \log f(y;\,\theta,\nu) = \frac{\theta y - b(\theta)}{\nu} + a(y,\nu) \>. $$

Here $\theta$ is the natural parameter and $\nu$ is the dispersion parameter. If $\nu$ were known, this would just be a standard one-parameter exponential family. All the GLMs considered below assume an error model from this family.

Consider a sample of a single observation from this family. If we fit $\theta$ by maximum likelihood, we get that $y = b'(\hat\theta)$, irrespective of the value of $\nu$. This readily extends to the case of an iid sample since the log likelihoods add, yielding $\bar y = b'(\hat\theta)$.

But, we also know, due to the nice regularity of the log density as a function of $\theta$, that $$ \frac{\partial}{\partial \theta} \mathbb E \log f(Y;\theta,\nu) = \mathbb E \frac{\partial}{\partial \theta} \log f(Y;\theta,\nu) = 0 \>. $$ So, in fact $b'(\theta) = \mathbb E Y = \mu$.

Since maximum likelihood estimates are invariant under transformations, this means that $ \bar y = \hat\mu $ for this family of densities.

Now, in a GLM, we model $\mu_i$ as $\mu_i = g^{-1}(\mathbf x_i^T \beta)$ where $g$ is the link function. But if $\mathbf x_i$ is a vector of all zeros except for a single 1 in position $j$, then $\mu_i = g(\beta_j)$. The likelihood of the GLM then factorizes according to the $\beta_j$'s and we proceed as above. This is precisely the case of orthogonal factors.

What's so different about continuous predictors?

When the predictors are continuous or they are categorical, but cannot be reduced to an orthogonal form, then the likelihood no longer factors into individual terms with a separate mean depending on a separate parameter. At this point, the error structure and link function do come into play.

If one cranks through the (tedious) algebra, the likelihood equations become $$ \sum_{i=1}^n \frac{(y_i - \mu_i)x_{ij}}{\sigma_i^2}\frac{\partial \mu_i}{\partial \lambda_i} = 0\>, $$ for all $j = 1,\ldots,p$ where $\lambda_i = \mathbf x_i^T \beta$. Here, the $\beta$ and $\nu$ parameters enter implicitly through the link relationship $\mu_i = g(\lambda_i) = g(\mathbf x_i^T \beta)$ and variance $\sigma_i^2$.

In this way, the link function and assumed error model become relevant to the estimation.

Example: The error model (almost) doesn't matter

In the example below, we generate negative binomial random data depending on three categorical factors. Each observation comes from a single category and the same dispersion parameter ($k = 6$) is used.

We then fit to these data using five different GLMs, each with a $\log$ link: (a) negative binomial, (b) Poisson, (c) Gaussian, (d) Inverse Gaussian and (e) Gamma GLMs. All of these are examples of exponential dispersion families.

From the table, we can see that the parameter estimates are identical, even though some of these GLMs are for discrete data and others are for continuous,and some are for nonnegative data while others are not.

      negbin  poisson gaussian invgauss    gamma
XX1 4.234107 4.234107 4.234107 4.234107 4.234107
XX2 4.790820 4.790820 4.790820 4.790820 4.790820
XX3 4.841033 4.841033 4.841033 4.841033 4.841033

The caveat in the heading comes from the fact that the fitting procedure will fail if the observations don't fall within the domain of the particular density. For example, if we had $0$ counts randomly generated in the data above, then the Gamma GLM would fail to converge since Gamma GLMs require strictly positive data.

Example: The link function (almost) doesn't matter

Using the same data, we repeat the procedure fitting the data with a Poisson GLM with three different link functions: (a) $\log$ link, (b) identity link and (c) square-root link. The table below shows the coefficient estimates after converting back to the log parameterization. (So, the second column showns $\log(\hat \beta)$ and the third shows $\log(\hat \beta^2)$ using the raw $\hat\beta$ from each of the fits). Again, the estimates are identical.

> coefs.po
         log       id     sqrt
XX1 4.234107 4.234107 4.234107
XX2 4.790820 4.790820 4.790820
XX3 4.841033 4.841033 4.841033

The caveat in the heading simply refers to the fact that the raw estimates will vary with the link function, but the implied mean-parameter estimates will not.

R code

# Warning! This code is a bit simplified for compactness.
library(MASS)
n <- 5
m <- 3
set.seed(17)
b <- exp(5+rnorm(m))
k <- 6

# Random negbin data; orthogonal factors
y <- rnbinom(m*n, size=k, mu=rep(b,each=n))
X <- factor(paste("X",rep(1:m,each=n),sep=""))

# Fit a bunch of GLMs with a log link
con <- glm.control(maxit=100)
mnb <- glm(y~X+0, family=negative.binomial(theta=2))
mpo <- glm(y~X+0, family="poisson")
mga <- glm(y~X+0, family=gaussian(link=log), start=rep(1,m), control=con)
miv <- glm(y~X+0, family=inverse.gaussian(link=log), start=rep(2,m), control=con)
mgm <- glm(y~X+0, family=Gamma(link=log), start=rep(1,m), control=con)    
coefs <- cbind(negbin=mnb$coef, poisson=mpo$coef, gaussian=mga$coef
                   invgauss=miv$coef, gamma=mgm$coef)

# Fit a bunch of Poisson GLMs with different links.
mpo.log  <- glm(y~X+0, family=poisson(link="log"))
mpo.id   <- glm(y~X+0, family=poisson(link="identity"))
mpo.sqrt <- glm(y~X+0, family=poisson(link="sqrt"))   
coefs.po <- cbind(log=mpo$coef, id=log(mpo.id$coef), sqrt=log(mpo.sqrt$coef^2))

Solved – Negative binomial regression in R allowing for correlation between dispersion & regression coefficients

I haven't found another R package which does this, but I have written code which, based on the maximum likelihood estimates of a model fitted with glm.nb, calculates the full variance covariance matrix using the observed information matrix.

Comparing to values from SAS this appears to match, but if anyone spots an error or finds that it does not match the variance covariance matrix from SAS or Stata, please add a comment to this answer.

glm.nb.cov <- function(mod) {
  #given a model fitted by glm.nb in MASS, this function returns a variance covariance matrix for the
  #regression coefficients and dispersion parameter, without assuming independence between these
  #note that the model must have been fitted with x=TRUE argument so that design matrix is available

  #formulae based on p23-p24 of http://pointer.esalq.usp.br/departamentos/lce/arquivos/aulas/2011/LCE5868/OverdispersionBook.pdf
  #and http://www.math.mcgill.ca/~dstephens/523/Papers/Lawless-1987-CJS.pdf

  k <- mod$theta
  #p is number of regression coefficients
  p <- dim(vcov(mod))[1]

  #construct observed information matrix
  obsInfo <- array(0, dim=c(p+1, p+1))

  #first calculate top left part for regression coefficients
  for (i in 1:p) {
    for (j in 1:p) {
      obsInfo[i,j] <- sum( (1+mod$y/mod$theta)*mod$fitted.values*mod$x[,i]*mod$x[,j] / (1+mod$fitted.values/mod$theta)^2  )
    }
  }

  #information for dispersion parameter
  obsInfo[(p+1),(p+1)] <- -sum(trigamma(mod$theta+mod$y) - trigamma(mod$theta) -
                                 1/(mod$fitted.values+mod$theta) + (mod$theta+mod$y)/(mod$theta+mod$fitted.values)^2 - 
                                 1/(mod$fitted.values+mod$theta) + 1/mod$theta)

  #covariance between regression coefficients and dispersion
  for (i in 1:p) {
    obsInfo[(p+1),i] <- -sum(((mod$y-mod$fitted.values) * mod$fitted.values / ( (mod$theta+mod$fitted.values)^2 )) * mod$x[,i] )
    obsInfo[i,(p+1)] <- obsInfo[(p+1),i]
  }

  #return variance covariance matrix
  solve(obsInfo)
}

Best Answer

Related Solutions

Regression – When Do Poisson and Negative Binomial Regressions Fit Same Coefficients?

Solved – Negative binomial regression in R allowing for correlation between dispersion & regression coefficients

Related Question