Solved – How to analyze residuals of Poisson log-linear model

count-datalog-linearpoisson distributionregressionresiduals

I have bird count data and use classical poisson loglinear model, i.e. we have counts obs(i,j) – observed count for site i and year j, and the model is:

ln(model(i,j)) = site_effect(i) + year_effect(j)

where model(i,j) is expected count for site i and year j, and it is assumed that counts are poisson distributed, i.e. obs(i,j) ~ Poiss(lambda = model(i,j)). Serial correlation and overdispersion was taken into account.

Now I need to analyse residuals of this model as a dependent variable in another regression with some explaining variables (Please don't tell me I should put these variables in the above model – to be brief, there are some technical issues behind that.)

The problem now is:
1) how to define the residuals
2) what is their distribution
3) which regression use to explain them.

Possible solutions of residual definition and analysis:
A) obs/model with another poisson loglinear regression
B) log((obs+1)/model), normal distribution -> normal linear regression
C) log((obs+1)/(model+1)), normal distribution -> normal linear regression
D) … any other??

Ad Solution A: use obs/model,i.e. divide observed value by the expected value from model (for each i,j). But what is their distribution? I would say Poisson, but the numbers are not whole numbers! So I multiplied by 20 and rounded to whole numbers to be able to fit poisson. Is this a good idea, or is it nonsense? Or how should I fit poisson to real number continuous random variable? Here is the distribution:

distribution of 20*obs/model, rounded to whole numbers, with Poisson fit

Note that the Poisson distribution doesn't fit (used goodfit() test in R).

So there is a problem that obs/model are real numbers – not whole natural, as the poisson distribution. Moreover, I'd like to avoid another poisson loglinear regression if possible, as I'm quite infamiliar with how to analyze the explaining variables, the explained/unexplained variability etc. as in the normal regression.

Ad Solution B/C: use something like log((obs+1)/model) or log((obs+1)/(model+1)). There is a problem of zero observed counts (there are no zero model counts), so I can't just use log(obs/model) as I would like to. How to solve this problem?? It is better to use solution B or C?
Here are the distributions.

Solution B

Solution C

Note that also in this case, the normal distributions don't fit. Nothing really fits, but this is quite normal with real count data. 🙂

So, summary:
– which solution you think fits best?
– how to overcome the problems each solution has?
– is it a big mistake to use B/C and go for normal linear regression, if I'd like to avoid the poisson? Should I use B or C?

Thanks a lot!

Best Answer

This sounds like a hierarchical model (or multi-level model), but not explained in those terms. Firstly, note that a Poisson "residual" is not really the same thing as you would find in the case of a normal error term. The reason is that a poisson random variable cannot be separated out easily into a "systematic part" and "random part". One standard way is to use the "deviance residuals", defined by:

$$d_{i}=sgn(Y_{i}-\hat{Y}_{i})\sqrt{2Y_{i}\log\left(\frac{Y_{i}}{\hat{Y}_{i}}\right)-2(Y_{i}-\hat{Y}_{i})}$$

These quantities are analogues to the residuals in OLS regression, and can be easily obtained from residuals(glm.object, "deviance") function in R. You can show that the sum of squared deviance residuals has an approximate chi-square distribution, but I wouldn't know how to figure out the exact distribution.

An alternative is to add further structure to the Poisson GLM by introducing "random effects" into the model equation. For the Poisson model with canoncial log-link function we have $\log(\mu_i)=x_{i}^{T}\beta$. In your case $x_{i}^{T}$ is a row of a "design matrix" (all zeros and ones) which picks out the relevant beta coefficient for your site effect and year effect. This addition of random effects basically says that the above relationship only holds "on the average" rather than deterministically (can think of this as a "softer" link function). So that we actually have $E\left[\log(\mu_i)\right]=x_{i}^{T}\beta$. The "actual link" is given by:

$$\log(\mu_i)=x_{i}^{T}\beta+z_{i}^{T}u_{i}$$

Where $z_{i}^{T}$ is a known set of covariates, and $u_i$ is a random effect for the $i$th observation with $E[u_{i}]=0$. Now if you were to set $z_{i}^{T}=1$ then you have a good definition of the residuals, given by the random effects $u_i$. All that is needed to do now is to assign a probability distribution to the random effects and do the maths/coding work to either give point estimates or posterior probability distributions for $u_{i}$ to use in your subsequent model. These are often called Generalised linear mixed models, and $u_i$ is often chosen to have a normal distribution.

The nlme package should be able to do this kind of modelling

Related Question