Solved – How is the intercept calculated in a generalized linear model and why is it different from a linear model

generalized linear modelnegative-binomial-distributionpoisson distributionr

I have a set of count data that I have fitted both a linear model to and a Poisson generalized linear model. The mean of the raw data is 233.375 and the standard deviation 279.983. I have been surprised when I fit a Poisson glm in R that the intercept = 5.53865. A negative bionomial glm in R gives an intercept of 5.4526.

Is this because the data are sparse and/or over-dispersed?

The data and R code are:

data <-c(2, 25,1121,361,251,123,123,81,25,215,4,196,0,353,968,336,179,229,92,204,35,299,8,371)
lm(data ~ 1)
glm(data ~ 1, family = poisson)
glm.nb(data ~1, link = log)

Best Answer

The difference in estimated intercepts is not because of the overdispersion in the data. Peter Flom's comment is the correct answer. To see this, change the lm() model into a glm() model with a gaussian family:

glm(data ~ 1, family = gaussian)
glm(data ~ 1, family = gaussian(link="log"),start=c(20))

The canonical link for the gaussian family is the identity link, so you get exactly the same estimate as for lm(). Changing the link to the log link function gives you the same estimate of the intercept that you're getting from the Poisson and NB models. The gaussian model with log link is $log(E(Y|X))=θ^′X$, while the glm with identity link is $E(Y|X)=θ^′X$. That's why exponentiating the estimated intercept for the log link models $e^{5.453} = 233$ gives you the estimated intercept for the identity link models - you are using the inverse link function. Getting the expected # of saplings per plot for this simple model is easy with just the value of the coefficient, but once you add treatment effects and other covariates it will be more difficult. You should use the predict() function like this:

data = data.frame(saplings=data,
                  treat=gl(2,6,24,label=c("control","treat")),
                  year=gl(2,12,label=c("2004","2011")))

test.glm = glm.nb(saplings~treat*year,link=log,data=data)
nd = data.frame(treat=gl(2,1,4,label=c("control","treat")),
            year=gl(2,2,label=c("2004","2011")))
predict(test.glm,newdata=nd,type="response")

Also see this question, and read Chapter 6 of Zuur et al (2007) "Analysing Ecological Data"

Related Solutions

Solved – Fitting a Generalized Linear Model (GLM) in R

There are three components to the GLM: an outcome variable, a linear predictor and a link function. The link function in the GLM relates the expected value of the outcome variable to the linear predictor. In other words, not the expected value itself, but a function of it is modeled by the linear predictor. An example with the logarithm as the link function and the linear predictor $\beta_0 + \beta_1*x$ is:

$$\log(E(y)) = \beta_0 + \beta_1*x$$

In your case, the linear predictor is $\log(\beta_0) + \beta_1*\log({\rm exp}_1) + \beta_2*\log({\rm exp}_2)$. So the equation for your model becomes:

$$\log(E(y)) = \log(\beta_0) + \beta_1*\log({\rm exp}_1) + \beta_2*\log({\rm exp}_2)$$

I think this is a bit weird and I would argue that possibly that's not the model you are supposed to fit. Anyway, to fit this model with R, the code should look like this:

model <- glm(formula = Y ~ log(exp1) + log(exp2), family = poisson(link="log"), 
             data = CSV_table)

The only thing you have to take care of after running the model is to take the exponential function of the intercept, if you want to write the intercept as a log. A good book if you want to learn about the GLM and categorical data analysis in general is the one by Agresti (2007).

References:

_{Agresti, A. (1996). An introduction to categorical data analysis (Vol. 135). New York: Wiley.}

Solved – hurdle model with negative binomial distribution of counts – error message and model selection

(1) It is hard to tell what exactly goes wrong here without a reproducible example. One possibility could be that there are no non-zero observations for certain combinations of factor levels. Another possibility could be that the theta estimate in the NB version of the model degenerates either towards zero or towards infinity and hence leads to numeric problems. It could also be something else, though...

(2) I wouldn't start testing the model that generated the warning before I figured out what went wrong in (1). I wouldn't just ignore the warning.

(3) I would recommend constructing the plots by hand rather than estimating a poorly fitting GLM. You can get fitted() and residuals() from the hurdle model and then call plot functions for scatter and QQ plots respectively.

Best Answer

Related Solutions

Solved – Fitting a Generalized Linear Model (GLM) in R

Solved – hurdle model with negative binomial distribution of counts – error message and model selection

Related Question