Solved – How is the intercept calculated in a generalized linear model and why is it different from a linear model

generalized linear modelnegative-binomial-distributionpoisson distributionr

I have a set of count data that I have fitted both a linear model to and a Poisson generalized linear model. The mean of the raw data is 233.375 and the standard deviation 279.983. I have been surprised when I fit a Poisson glm in R that the intercept = 5.53865. A negative bionomial glm in R gives an intercept of 5.4526.

Is this because the data are sparse and/or over-dispersed?

The data and R code are:

data <-c(2, 25,1121,361,251,123,123,81,25,215,4,196,0,353,968,336,179,229,92,204,35,299,8,371)
lm(data ~ 1)
glm(data ~ 1, family = poisson)
glm.nb(data ~1, link = log)

Best Answer

The difference in estimated intercepts is not because of the overdispersion in the data. Peter Flom's comment is the correct answer. To see this, change the lm() model into a glm() model with a gaussian family:

glm(data ~ 1, family = gaussian)
glm(data ~ 1, family = gaussian(link="log"),start=c(20))

The canonical link for the gaussian family is the identity link, so you get exactly the same estimate as for lm(). Changing the link to the log link function gives you the same estimate of the intercept that you're getting from the Poisson and NB models. The gaussian model with log link is $log(E(Y|X))=θ^′X$, while the glm with identity link is $E(Y|X)=θ^′X$. That's why exponentiating the estimated intercept for the log link models $e^{5.453} = 233$ gives you the estimated intercept for the identity link models - you are using the inverse link function. Getting the expected # of saplings per plot for this simple model is easy with just the value of the coefficient, but once you add treatment effects and other covariates it will be more difficult. You should use the predict() function like this:

data = data.frame(saplings=data,
                  treat=gl(2,6,24,label=c("control","treat")),
                  year=gl(2,12,label=c("2004","2011")))

test.glm = glm.nb(saplings~treat*year,link=log,data=data)
nd = data.frame(treat=gl(2,1,4,label=c("control","treat")),
            year=gl(2,2,label=c("2004","2011")))
predict(test.glm,newdata=nd,type="response")

Also see this question, and read Chapter 6 of Zuur et al (2007) "Analysing Ecological Data"