Poisson Regression – Handling Strong Patterns in Residuals for Poisson Regression

generalized linear modelpoisson-regressionresiduals

I am running this Poisson regression, and am facing a very strong pattern in deviance and pearson residuals, what would be an appropriate approach to rectify the model?

Here is the head of my data:

> head(bob_poisson_aggreg)
  ticketCount artistVotes capacity ticketsRemain
1         120        1168      169            49
2          21        4365      379           358
3         153        3710     2352          2199
4         158        8766      615           457
5          25         622       50            25
6         314        7700      700           386
  bob_poisson_mean_aggreg.artistRating
1                                 4.57
2                                 4.67
3                                 4.90
4                                 4.49
5                                 4.38
6                                 4.42

Here is the model I ran:

mod_poi_1 <- glm(ticketCount ~. , family = poisson , data = bob_poisson_aggreg)
summary(mod_poi_1)
Call:
glm(formula = ticketCount ~ ., family = poisson, data = bob_poisson_aggreg)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-10.5927   -2.5578    0.1436    2.0250    7.7396  

Coefficients:
                                       Estimate Std. Error  z value
(Intercept)                           2.699e+00  1.260e-01   21.418
artistVotes                          -1.124e-05  4.252e-07  -26.435
capacity                              8.464e-03  7.584e-05  111.604
ticketsRemain                        -8.449e-03  8.109e-05 -104.188
bob_poisson_mean_aggreg.artistRating  1.914e-01  2.823e-02    6.781
                                     Pr(>|z|)    
(Intercept)                           < 2e-16 ***
artistVotes                           < 2e-16 ***
capacity                              < 2e-16 ***
ticketsRemain                         < 2e-16 ***
bob_poisson_mean_aggreg.artistRating 1.19e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18688.0  on 274  degrees of freedom
Residual deviance:  3388.1  on 270  degrees of freedom
AIC: 4972

Number of Fisher Scoring iterations: 4

I am not even sure that a poisson model would be appropriate here, but if it were, how should I move on? Would I do a transformation on the response variable : "ticketCount".

Is there some general procedure to follow?

I would much appreciate any insight or references!

Best Answer

I would not begin by transforming the response variable (DV).

I'd start by considering whether you have the right link function or whether you should transform some x's (independent variables).

If you expect the ticketCount to be proportional to some of those predictors (I sure would), you might either want to use an identity link or enter the logs of some relevant predictors, possibly putting them in as offsets; the choice would depend on whether you see the way that the IVs would relate to the response as additive or multiplicative on the untransformed ticketCount scale.

There's other things you could consider, but careful consideration of the relationship between the DV and IVs is central to choosing a good model.

Here's an example of simulated Poisson data where the simulation used an identity link (i.e. $E(Y)$ is linear in $x$ -- in fact proportional to it in this case) while the fit used the default log-link:

That's reasonably consistent with what you see. As soon as I saw your plot my first thought was "maybe an identity link was needed" and then after looking at your variable names that seemed like it might make a lot of sense; I'd have probably tried the identity link first.

Here's what I get when I use my two proposed solutions:

You can see that both of them solved the lack of fit problem, but there's some heteroskedasticity in the case where I fitted a different link to what I used to generate the data (this is as expected).

# example code included as requested
#(assumes we have already used `par` to get 1x2 grid for plots like above)

# generate data
x=runif(1000,1,20)
y=rpois(1000,33*x)

#fit (incorrect in this case) log-link function
pfit=glm(y~x,family=poisson)

# first plot
plot(x,y)
plot(pfit,which=1) #shows bowed appearance 

# identity link (correct in this case)
#  generalizes to additive in predictors
pfiti=glm(y~x,family=poisson(link="identity"))

# log link, log predictor, suits *multiplicative* model
#  and general case is in powers - could fit a model like
#   E(Y) = a . X1 . X2 . X3^b3 . X4^b4 using offsets
pfitl=glm(y~log(x),family=poisson)

# second plot
plot(pfiti,which=1)
plot(pfitl,which=1)

Related Solutions

Poisson Regression – Residuals in Poisson Regression

The distinction is clear as soon as you understand what a Pearson residual is.

You are correct that for a Poisson model, variance increases as mean increases.

As a result, ordinary raw residuals ($r_i=y_i-\hat\mu_i$) should have a spread that increases with fitted values (though not in proportion).

However, Pearson residuals are residuals divided by the square root of the variance according to the model ($r^P_i=\frac{y_i-\hat\mu_i}{\sqrt{\hat\mu_i}}$ for a Poisson model). This means that if the model is correct, the Pearson residuals should have constant spread.

Residual plots from a simple simulated Poisson regression model. Left plot: raw residuals vs fitted mean show increasing spread with mean. There is diagonal "banding" in the residuals because the data are discrete. Right plot: Pearson residuals show what looks like constant spread as mean changes, and the diagonal bands are now curved.

Solved – Poisson regression residuals diagnostic

As explained here, neither deviance nor Pearson residuals are ideal for diagnosing Poisson models, as they will appear visually inhomogeneous for low count rates, even if the model is entirely correct.

Instead, you can use the DHARMa package, which implements the idea of randomized quantile residuals by Dunn and Smyth (1996). Here an example with a missing quadratic predictor in the glm. Note the comments about testing for overdispersion in the vignette.

library(DHARMa)

dat = createData(replicates = 1, sampleSize = 300, intercept = 0,
           fixedEffects = 1, quadraticFixedEffects = 2, 
           randomEffectVariance = 0, family = poisson())

fit = glm(observedResponse ~ Environment1 , data = dat, family = poisson)
res = simulateResiduals(fit)
plot(res)

Best Answer

Related Solutions

Poisson Regression – Residuals in Poisson Regression

Solved – Poisson regression residuals diagnostic

Related Question