Solved – How does glmnet handle overdispersion

glmnetlassooverdispersionpoisson distributionregularization

I have a question about how to model text over count data, in particular how could I use the lasso technique to reduce features.

Say I have N online articles and the count of pageviews for each article. I've extracted 1-grams and 2-grams for each article and I wanted to run a regression over the 1,2-grams. Since the features (1,2-grams) are way more than the number of observations, the lasso would be a nice method to reduce the number of features. Also, I've found glmnet is really handy to run lasso analysis.

However, the count number of pageviews are overdispersed (variance > mean), but glmnet doesn't offer quasipoisson (explicitly) or negative binomial but poisson for count data. The solution I've thought of is to log transform the count data (a commonly used method among social scientists) and make the response variable roughly follow a normal distribution. As such, I could possibly model the data with the gaussian family using glmnet.

So my question is: is it appropriate to do so? Or, shall I just use poisson for glmnet in case glmnet handles quasipoisson? Or, is there other R packages handle this situation?

Thank you very much!

Best Answer

Short answer

Overdispersion does not matter when estimating a vector of regression coefficients for the conditional mean in a quasi/poisson model! You will be fine if you forget about the overdispersion here, use glmnet with the poisson family and just focus on whether your cross-validated prediction error is low.

The Qualification follows below.

Poisson, Quasi-Poisson and estimating functions:

I say the above because overdispersion (OD) in a poisson or quasi-poisson model influences anything to do with the dispersion (or variance or scale or heterogeneity or spread or whatever you want to call it) and as such has an effect on the standard errors and confidence intervals but leaves the estimates for the conditional mean of $y$ (called $\mu$) untouched. This particularly applies to linear decompositions of the mean, like $x^\top\beta$.

This comes from the fact that the estimating equations for the coefficients of the conditional mean are practically the same for both poisson and quasi-poisson models. Quasi-poisson specifies the variance function in terms of the mean and an additional parameter (say $\theta$) as $Var(y)=\theta\mu$ (with for Poisson $\theta$=1), but the $\theta$ does not turn out to be relevant when optimizing the estimating equation. Thus the $\theta$ plays no role in estimating the $\beta$ when conditional mean and variance are proportional. Therefore the point estimates $\hat{\beta}$ are identical for the quasi- and poisson models!

Let me illustrate with an example (notice that one needs to scroll to see the whole code and output) :

> library(MASS)
> data(quine) 
> modp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="poisson")
> modqp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="quasipoisson")
> summary(modp)

Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "poisson", 
    data = quine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.808  -3.065  -1.119   1.819   9.909  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.71538    0.06468  41.980  < 2e-16 ***
AgeF1       -0.33390    0.07009  -4.764 1.90e-06 ***
AgeF2        0.25783    0.06242   4.131 3.62e-05 ***
AgeF3        0.42769    0.06769   6.319 2.64e-10 ***
SexM         0.16160    0.04253   3.799 0.000145 ***
EthN        -0.53360    0.04188 -12.740  < 2e-16 ***
LrnSL        0.34894    0.05204   6.705 2.02e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1696.7  on 139  degrees of freedom
AIC: 2299.2

Number of Fisher Scoring iterations: 5

> summary(modqp)

Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "quasipoisson", 
    data = quine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.808  -3.065  -1.119   1.819   9.909  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.7154     0.2347  11.569  < 2e-16 ***
AgeF1        -0.3339     0.2543  -1.313 0.191413    
AgeF2         0.2578     0.2265   1.138 0.256938    
AgeF3         0.4277     0.2456   1.741 0.083831 .  
SexM          0.1616     0.1543   1.047 0.296914    
EthN         -0.5336     0.1520  -3.511 0.000602 ***
LrnSL         0.3489     0.1888   1.848 0.066760 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 13.16691)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1696.7  on 139  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

As you can see even though we have strong overdispersion of 12.21 in this data set (by deviance(modp)/modp$df.residual) the regression coefficients (point estimates) do not change at all. But notice how the standard errors change.

The question of the effect of overdispersion in penalized poisson models

Penalized models are mostly used for prediction and variable selection and not (yet) for inference. So people who use these models are interested in the regression parameters for the conditional mean, just shrunk towards zero. If the penalization is the same, the estimating equations for the conditional means derived from the penalized (quasi-)likelihood does also not depend on $\theta$ and therefore overdispersion does not matter for the estimates of $\beta$ in a model of the type:

$g(\mu)=x^\top\beta + f(\beta)$

as $\beta$ is estimated the same way for any variance function of the form $\theta \mu$, so again for all models where conditional mean and variance are proportional. This is just like in unpenalized poisson/quasipoisson models.

If you don't want to take this at face value and avoid the math, you can find empirical support in the fact that in glmnet, if you set the regularization parameter towards 0 (and thus $f(\beta)=0$) you end up pretty much where the poisson and quasipoisson models land (see the last column below where lambda is 0.005).

> library(glmnet)
> y <- quine[,5]
> x <- model.matrix(~Age+Sex+Eth+Lrn,quine)
> modl <- glmnet(y=y,x=x, lambda=c(0.05,0.02,0.01,0.005), family="poisson")
> coefficients(modl)
8 x 4 sparse Matrix of class "dgCMatrix"
                    s0         s1         s2         s3
(Intercept)  2.7320435  2.7221245  2.7188884  2.7172098
(Intercept)  .          .          .          .        
AgeF1       -0.3325689 -0.3335226 -0.3339580 -0.3340520
AgeF2        0.2496120  0.2544253  0.2559408  0.2567880
AgeF3        0.4079635  0.4197509  0.4236024  0.4255759
SexM         0.1530040  0.1581563  0.1598595  0.1607162
EthN        -0.5275619 -0.5311830 -0.5323936 -0.5329969
LrnSL        0.3336885  0.3428815  0.3459650  0.3474745

So what does OD do to penalized regression models? As you may know, there is still some debate about the proper way to calculate standard errors for penalized models (see e.g., here ) and glmnet is not outputting any anyway, probably for that reason. It may very well be that the OD would influence the inference part of the model, just as it does in the non-penalized case but unless some consensus regarding inference in this case is reached, we won't know.

As an aside, one can leave all this messiness behind if one is willing to adopt a Bayesian view where penalized models are just standard models with a specific prior.

Related Solutions

Solved – Poisson or quasi poisson in a regression with count data and overdispersion

When trying to determine what sort of glm equation you want to estimate, you should think about plausible relationships between the expected value of your target variable given the right hand side (rhs) variables and the variance of the target variable given the rhs variables. Plots of the residuals vs. the fitted values from from your Normal model can help with this. With Poisson regression, the assumed relationship is that the variance equals the expected value; rather restrictive, I think you'll agree. With a "standard" linear regression, the assumption is that the variance is constant regardless of the expected value. For a quasi-poisson regression, the variance is assumed to be a linear function of the mean; for negative binomial regression, a quadratic function.

However, you aren't restricted to these relationships. The specification of a "family" (other than "quasi") determines the mean-variance relationship. I don't have The R Book, but I imagine it has a table that shows the family functions and corresponding mean-variance relationships. For the "quasi" family you can specify any of several mean-variance relationships, and you can even write your own; see the R documentation. It may be that you can find a much better fit by specifying a non-default value for the mean-variance function in a "quasi" model.

You also should pay attention to the range of the target variable; in your case it's nonnegative count data. If you have a substantial fraction of low values - 0, 1, 2 - the continuous distributions probably won't fit well, but if you don't, there's not much value in using a discrete distribution. It's rare that you'd consider Poisson and Normal distributions as competitors.

Solved – Is it appropriate to account for overdispersion in a glm by using a quasi-binomial distribution

Firstly, I don't think you have zero-inflation in your data (or at least the data that you have included in the question). Zero-inflation arises when something (unobserved) results in a zero count/observation even though the other predictors suggest that the observation should be positive. In your case, a (silly) example might be a disgruntled grad student sneaking into the lab and spraying DDT on some of the warm temperature experiments to mess with your head - even though the subjects should have survived at e.g. 28 degrees, some unseen force has prevented this from happening. A less silly example is a recorded zero abundance in habitat data simply because a perfectly suitable area has never been colonised (either by chance or some physical barrier), or a zero recording of parasite counts from a highly susceptible animal that has never been exposed to the parasites. I think people are generally too quick to jump to zero-inflated models simply because of a large number of observed zeros - see also:
Warton, D.I. 2005. Many zeros does not mean zero inflation: comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16:275–289.

So if you want to model over-dispersion then I would suggest using an observation level random effect (where 'observation' is the number dead and alive from each group e.g. {10,0} for the first row). I have used this approach successfully for similar analyses, although generally for larger group sizes than 10.

However based on the data you have shown I don't think this is necessary either: all of the observations below 32 degrees are entirely consistent with a common probability of survival (around 97%), and all of the observations above 34 degrees are also entirely consistent with a common probability of survival (around 3%). If you fit an over dispersed model to this then the optimiser will probably reduce the over dispersion component to zero. If this really is your data then what you actually need to fit is a temperature threshold effect (e.g. above/below 33 degrees), which will then describe the data so well that it will in fact be quasi separated ... leading you to potentially more problems! Of course it is also possible that the data you have shown is incomplete and/or a fabricated example, in which case you can ignore this paragraph :)

---- EDIT IN RESPONSE TO EDITED QUESTION ----

The model that you are tying to fit uses a linear effect of temperature, but your data suggests that the effect is not linear (on the logit scale). If you have only a linear effect of temperature then an additional parameter (over dispersion) is needed to suck up the extra unexplained variation in the response, but you may be able to do a better job with a more appropriate effect of temperature. Try the following code for inspiration.

Your data, and a new data frame to use only for visualising predictions:

df <- read.table(header=TRUE, file=textConnection('Temperature Alive   Dead
28  10  0
28  10  0
28  9   1
28  10  0
30  10  0
30  10  0
30  10  0
30  10  0
32  9   1
32  9   1
32  9   1
32  10  0
34  0   10
34  0   10
34  0   10
34  2   8
36  0   10
36  0   10
36  0   10
36  0   10'))

df$Response <- cbind(df$Alive, df$Dead)
	df$Proportion <- df$Alive / (df$Alive + df$Dead)
	df$Replicate <- 1:4

newdata <- data.frame(Temperature=seq(28,36,length=1000))

The model you are using assumes a linear (on the logit scale) effect of temperature, but the plot of the datapoints suggests a more drastic change between 32 and 34 degrees than is consistent with a linear change:

model1 <- glm(Response ~ Temperature, family=binomial, data=df)
extractAIC(model1)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model1, type='response', newdata=newdata), type='l')

A simple threshold effect of 33 degrees gives a better prediction:

model2 <- glm(Response ~ I(Temperature > 33), family=binomial, data=df)
extractAIC(model2)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model2, type='response', newdata=newdata), type='l')

An alternative is to use a polynomial expansion to explain a curve with (almost) arbitrary shape - the highest order we can use with your data is 4 but this seems to give the best fit:

model3 <- glm(Response ~ poly(Temperature, 4), family=binomial, data=df)
extractAIC(model3)
plot(df$Temperature, df$Proportion, pch=df$Replicate)
	lines(newdata$Temperature, predict(model3, type='response', newdata=newdata), type='l')

I haven't checked, but I suspect that your test for over dispersion would indicate no problems with either models 2 or 3. The obvious problem with model 2 is that we have chosen the threshold based on the data, so this doesn't help you find the threshold itself using the model. For that reason I'd probably use something more like model 3.

Best Answer

Related Solutions

Solved – Poisson or quasi poisson in a regression with count data and overdispersion

Solved – Is it appropriate to account for overdispersion in a glm by using a quasi-binomial distribution

Related Question