Solved – How does glmnet handle overdispersion

glmnetlassooverdispersionpoisson distributionregularization

I have a question about how to model text over count data, in particular how could I use the lasso technique to reduce features.

Say I have N online articles and the count of pageviews for each article. I've extracted 1-grams and 2-grams for each article and I wanted to run a regression over the 1,2-grams. Since the features (1,2-grams) are way more than the number of observations, the lasso would be a nice method to reduce the number of features. Also, I've found glmnet is really handy to run lasso analysis.

However, the count number of pageviews are overdispersed (variance > mean), but glmnet doesn't offer quasipoisson (explicitly) or negative binomial but poisson for count data. The solution I've thought of is to log transform the count data (a commonly used method among social scientists) and make the response variable roughly follow a normal distribution. As such, I could possibly model the data with the gaussian family using glmnet.

So my question is: is it appropriate to do so? Or, shall I just use poisson for glmnet in case glmnet handles quasipoisson? Or, is there other R packages handle this situation?

Thank you very much!

Best Answer

Short answer

Overdispersion does not matter when estimating a vector of regression coefficients for the conditional mean in a quasi/poisson model! You will be fine if you forget about the overdispersion here, use glmnet with the poisson family and just focus on whether your cross-validated prediction error is low.

The Qualification follows below.


Poisson, Quasi-Poisson and estimating functions:

I say the above because overdispersion (OD) in a poisson or quasi-poisson model influences anything to do with the dispersion (or variance or scale or heterogeneity or spread or whatever you want to call it) and as such has an effect on the standard errors and confidence intervals but leaves the estimates for the conditional mean of $y$ (called $\mu$) untouched. This particularly applies to linear decompositions of the mean, like $x^\top\beta$.

This comes from the fact that the estimating equations for the coefficients of the conditional mean are practically the same for both poisson and quasi-poisson models. Quasi-poisson specifies the variance function in terms of the mean and an additional parameter (say $\theta$) as $Var(y)=\theta\mu$ (with for Poisson $\theta$=1), but the $\theta$ does not turn out to be relevant when optimizing the estimating equation. Thus the $\theta$ plays no role in estimating the $\beta$ when conditional mean and variance are proportional. Therefore the point estimates $\hat{\beta}$ are identical for the quasi- and poisson models!

Let me illustrate with an example (notice that one needs to scroll to see the whole code and output) :

> library(MASS)
> data(quine) 
> modp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="poisson")
> modqp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="quasipoisson")
> summary(modp)

Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "poisson", 
    data = quine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.808  -3.065  -1.119   1.819   9.909  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.71538    0.06468  41.980  < 2e-16 ***
AgeF1       -0.33390    0.07009  -4.764 1.90e-06 ***
AgeF2        0.25783    0.06242   4.131 3.62e-05 ***
AgeF3        0.42769    0.06769   6.319 2.64e-10 ***
SexM         0.16160    0.04253   3.799 0.000145 ***
EthN        -0.53360    0.04188 -12.740  < 2e-16 ***
LrnSL        0.34894    0.05204   6.705 2.02e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1696.7  on 139  degrees of freedom
AIC: 2299.2

Number of Fisher Scoring iterations: 5

> summary(modqp)

Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "quasipoisson", 
    data = quine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.808  -3.065  -1.119   1.819   9.909  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.7154     0.2347  11.569  < 2e-16 ***
AgeF1        -0.3339     0.2543  -1.313 0.191413    
AgeF2         0.2578     0.2265   1.138 0.256938    
AgeF3         0.4277     0.2456   1.741 0.083831 .  
SexM          0.1616     0.1543   1.047 0.296914    
EthN         -0.5336     0.1520  -3.511 0.000602 ***
LrnSL         0.3489     0.1888   1.848 0.066760 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 13.16691)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1696.7  on 139  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

As you can see even though we have strong overdispersion of 12.21 in this data set (by deviance(modp)/modp$df.residual) the regression coefficients (point estimates) do not change at all. But notice how the standard errors change.

The question of the effect of overdispersion in penalized poisson models

Penalized models are mostly used for prediction and variable selection and not (yet) for inference. So people who use these models are interested in the regression parameters for the conditional mean, just shrunk towards zero. If the penalization is the same, the estimating equations for the conditional means derived from the penalized (quasi-)likelihood does also not depend on $\theta$ and therefore overdispersion does not matter for the estimates of $\beta$ in a model of the type:

$g(\mu)=x^\top\beta + f(\beta)$

as $\beta$ is estimated the same way for any variance function of the form $\theta \mu$, so again for all models where conditional mean and variance are proportional. This is just like in unpenalized poisson/quasipoisson models.

If you don't want to take this at face value and avoid the math, you can find empirical support in the fact that in glmnet, if you set the regularization parameter towards 0 (and thus $f(\beta)=0$) you end up pretty much where the poisson and quasipoisson models land (see the last column below where lambda is 0.005).

> library(glmnet)
> y <- quine[,5]
> x <- model.matrix(~Age+Sex+Eth+Lrn,quine)
> modl <- glmnet(y=y,x=x, lambda=c(0.05,0.02,0.01,0.005), family="poisson")
> coefficients(modl)
8 x 4 sparse Matrix of class "dgCMatrix"
                    s0         s1         s2         s3
(Intercept)  2.7320435  2.7221245  2.7188884  2.7172098
(Intercept)  .          .          .          .        
AgeF1       -0.3325689 -0.3335226 -0.3339580 -0.3340520
AgeF2        0.2496120  0.2544253  0.2559408  0.2567880
AgeF3        0.4079635  0.4197509  0.4236024  0.4255759
SexM         0.1530040  0.1581563  0.1598595  0.1607162
EthN        -0.5275619 -0.5311830 -0.5323936 -0.5329969
LrnSL        0.3336885  0.3428815  0.3459650  0.3474745

So what does OD do to penalized regression models? As you may know, there is still some debate about the proper way to calculate standard errors for penalized models (see e.g., here ) and glmnet is not outputting any anyway, probably for that reason. It may very well be that the OD would influence the inference part of the model, just as it does in the non-penalized case but unless some consensus regarding inference in this case is reached, we won't know.

As an aside, one can leave all this messiness behind if one is willing to adopt a Bayesian view where penalized models are just standard models with a specific prior.