Poisson Regression – Handling Strong Patterns in Residuals for Poisson Regression

generalized linear modelpoisson-regressionresiduals

I am running this Poisson regression, and am facing a very strong pattern in deviance and pearson residuals, what would be an appropriate approach to rectify the model?

enter image description here

enter image description here

Here is the head of my data:

> head(bob_poisson_aggreg)
  ticketCount artistVotes capacity ticketsRemain
1         120        1168      169            49
2          21        4365      379           358
3         153        3710     2352          2199
4         158        8766      615           457
5          25         622       50            25
6         314        7700      700           386
  bob_poisson_mean_aggreg.artistRating
1                                 4.57
2                                 4.67
3                                 4.90
4                                 4.49
5                                 4.38
6                                 4.42

Here is the model I ran:

mod_poi_1 <- glm(ticketCount ~. , family = poisson , data = bob_poisson_aggreg)
summary(mod_poi_1)
Call:
glm(formula = ticketCount ~ ., family = poisson, data = bob_poisson_aggreg)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-10.5927   -2.5578    0.1436    2.0250    7.7396  

Coefficients:
                                       Estimate Std. Error  z value
(Intercept)                           2.699e+00  1.260e-01   21.418
artistVotes                          -1.124e-05  4.252e-07  -26.435
capacity                              8.464e-03  7.584e-05  111.604
ticketsRemain                        -8.449e-03  8.109e-05 -104.188
bob_poisson_mean_aggreg.artistRating  1.914e-01  2.823e-02    6.781
                                     Pr(>|z|)    
(Intercept)                           < 2e-16 ***
artistVotes                           < 2e-16 ***
capacity                              < 2e-16 ***
ticketsRemain                         < 2e-16 ***
bob_poisson_mean_aggreg.artistRating 1.19e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18688.0  on 274  degrees of freedom
Residual deviance:  3388.1  on 270  degrees of freedom
AIC: 4972

Number of Fisher Scoring iterations: 4

I am not even sure that a poisson model would be appropriate here, but if it were, how should I move on? Would I do a transformation on the response variable : "ticketCount".

Is there some general procedure to follow?

I would much appreciate any insight or references!

Best Answer

I would not begin by transforming the response variable (DV).

I'd start by considering whether you have the right link function or whether you should transform some x's (independent variables).

If you expect the ticketCount to be proportional to some of those predictors (I sure would), you might either want to use an identity link or enter the logs of some relevant predictors, possibly putting them in as offsets; the choice would depend on whether you see the way that the IVs would relate to the response as additive or multiplicative on the untransformed ticketCount scale.

There's other things you could consider, but careful consideration of the relationship between the DV and IVs is central to choosing a good model.

Here's an example of simulated Poisson data where the simulation used an identity link (i.e. $E(Y)$ is linear in $x$ -- in fact proportional to it in this case) while the fit used the default log-link:

Plot showing curved effect of wrong link function on residual vs fitted plot

That's reasonably consistent with what you see. As soon as I saw your plot my first thought was "maybe an identity link was needed" and then after looking at your variable names that seemed like it might make a lot of sense; I'd have probably tried the identity link first.

Here's what I get when I use my two proposed solutions:

Residual plots when using identity link and using log(x) as predictor

You can see that both of them solved the lack of fit problem, but there's some heteroskedasticity in the case where I fitted a different link to what I used to generate the data (this is as expected).


# example code included as requested
#(assumes we have already used `par` to get 1x2 grid for plots like above)

# generate data
x=runif(1000,1,20)
y=rpois(1000,33*x)

#fit (incorrect in this case) log-link function
pfit=glm(y~x,family=poisson)

# first plot
plot(x,y)
plot(pfit,which=1) #shows bowed appearance 

# identity link (correct in this case)
#  generalizes to additive in predictors
pfiti=glm(y~x,family=poisson(link="identity"))

# log link, log predictor, suits *multiplicative* model
#  and general case is in powers - could fit a model like
#   E(Y) = a . X1 . X2 . X3^b3 . X4^b4 using offsets
pfitl=glm(y~log(x),family=poisson)

# second plot
plot(pfiti,which=1)
plot(pfitl,which=1)
Related Question