I found that the intercept in GLMnet is computed after the new coefficients updates have converged. The intercept is computed with the means of the $y_i$'s and the mean of the $x_{ij}$'s. The formula is siimilar to the previous one I gave but with the $\beta_j$'s after the update loop : $\beta_0=\bar{y}-\sum_{j=1}^{p} \hat{\beta_j} \bar{x_j}$.
In python this gives something like :
self.intercept_ = ymean - np.dot(Xmean, self.coef_.T)
which I found here on scikit-learn page.
EDIT : the coefficients have to be standardized before :
self.coef_ = self.coef_ / X_std
$\beta_0=\bar{y}-\sum_{j=1}^{p} \frac{\hat{\beta_j} \bar{x_j}}{\sum_{i=1}^{n} x_{ij}^2}$.
Short answer
Overdispersion does not matter when estimating a vector of regression coefficients for the conditional mean in a quasi/poisson model! You will be fine if you forget about the overdispersion here, use glmnet with the poisson family and just focus on whether your cross-validated prediction error is low.
The Qualification follows below.
Poisson, Quasi-Poisson and estimating functions:
I say the above because overdispersion (OD) in a poisson or quasi-poisson model influences anything to do with the dispersion (or variance or scale or heterogeneity or spread or whatever you want to call it) and as such has an effect on the standard errors and confidence intervals but leaves the estimates for the conditional mean of $y$ (called $\mu$) untouched. This particularly applies to linear decompositions of the mean, like $x^\top\beta$.
This comes from the fact that the estimating equations for the coefficients of the conditional mean are practically the same for both poisson and quasi-poisson models. Quasi-poisson specifies the variance function in terms of the mean and an additional parameter (say $\theta$) as $Var(y)=\theta\mu$ (with for Poisson $\theta$=1), but the $\theta$ does not turn out to be relevant when optimizing the estimating equation. Thus the $\theta$ plays no role in estimating the $\beta$ when conditional mean and variance are proportional. Therefore the point estimates $\hat{\beta}$ are identical for the quasi- and poisson models!
Let me illustrate with an example (notice that one needs to scroll to see the whole code and output) :
> library(MASS)
> data(quine)
> modp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="poisson")
> modqp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="quasipoisson")
> summary(modp)
Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "poisson",
data = quine)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.808 -3.065 -1.119 1.819 9.909
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.71538 0.06468 41.980 < 2e-16 ***
AgeF1 -0.33390 0.07009 -4.764 1.90e-06 ***
AgeF2 0.25783 0.06242 4.131 3.62e-05 ***
AgeF3 0.42769 0.06769 6.319 2.64e-10 ***
SexM 0.16160 0.04253 3.799 0.000145 ***
EthN -0.53360 0.04188 -12.740 < 2e-16 ***
LrnSL 0.34894 0.05204 6.705 2.02e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2073.5 on 145 degrees of freedom
Residual deviance: 1696.7 on 139 degrees of freedom
AIC: 2299.2
Number of Fisher Scoring iterations: 5
> summary(modqp)
Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "quasipoisson",
data = quine)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.808 -3.065 -1.119 1.819 9.909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7154 0.2347 11.569 < 2e-16 ***
AgeF1 -0.3339 0.2543 -1.313 0.191413
AgeF2 0.2578 0.2265 1.138 0.256938
AgeF3 0.4277 0.2456 1.741 0.083831 .
SexM 0.1616 0.1543 1.047 0.296914
EthN -0.5336 0.1520 -3.511 0.000602 ***
LrnSL 0.3489 0.1888 1.848 0.066760 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 13.16691)
Null deviance: 2073.5 on 145 degrees of freedom
Residual deviance: 1696.7 on 139 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
As you can see even though we have strong overdispersion of 12.21 in this data set (by deviance(modp)/modp$df.residual
) the regression coefficients (point estimates) do not change at all. But notice how the standard errors change.
The question of the effect of overdispersion in penalized poisson models
Penalized models are mostly used for prediction and variable selection and not (yet) for inference. So people who use these models are interested in the regression parameters for the conditional mean, just shrunk towards zero. If the penalization is the same, the estimating equations for the conditional means derived from the penalized (quasi-)likelihood does also not depend on $\theta$ and therefore overdispersion does not matter for the estimates of $\beta$ in a model of the type:
$g(\mu)=x^\top\beta + f(\beta)$
as $\beta$ is estimated the same way for any variance function of the form $\theta \mu$, so again for all models where conditional mean and variance are proportional. This is just like in unpenalized poisson/quasipoisson models.
If you don't want to take this at face value and avoid the math, you can find empirical support in the fact that in glmnet
, if you set the regularization parameter towards 0 (and thus $f(\beta)=0$) you end up pretty much where the poisson and quasipoisson models land (see the last column below where lambda is 0.005).
> library(glmnet)
> y <- quine[,5]
> x <- model.matrix(~Age+Sex+Eth+Lrn,quine)
> modl <- glmnet(y=y,x=x, lambda=c(0.05,0.02,0.01,0.005), family="poisson")
> coefficients(modl)
8 x 4 sparse Matrix of class "dgCMatrix"
s0 s1 s2 s3
(Intercept) 2.7320435 2.7221245 2.7188884 2.7172098
(Intercept) . . . .
AgeF1 -0.3325689 -0.3335226 -0.3339580 -0.3340520
AgeF2 0.2496120 0.2544253 0.2559408 0.2567880
AgeF3 0.4079635 0.4197509 0.4236024 0.4255759
SexM 0.1530040 0.1581563 0.1598595 0.1607162
EthN -0.5275619 -0.5311830 -0.5323936 -0.5329969
LrnSL 0.3336885 0.3428815 0.3459650 0.3474745
So what does OD do to penalized regression models? As you may know, there is still some debate about the proper way to calculate standard errors for penalized models (see e.g., here ) and glmnet
is not outputting any anyway, probably for that reason. It may very well be that the OD would influence the inference part of the model, just as it does in the non-penalized case but unless some consensus regarding inference in this case is reached, we won't know.
As an aside, one can leave all this messiness behind if one is willing to adopt a Bayesian view where penalized models are just standard models with a specific prior.
Best Answer
Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:
cv.glmnet
will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggestcv$lambda.1se
instead ofcv$lambda.min
, but in practice I've had success with the latter.After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (
cv$lambda
) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).To extract the desired run from
cv.glmnet.fit
, try this:Revision (1/28/2017)
No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the
s = "lambda.min"
,s = "lambda.1se"
or some other number (e.g.,s = .007
) to bothcoef
andpredict
. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an"s"
incoef
andpredict
, you'll be using the default ofs = "lambda.1se"
. I have warmed up to that default after seeing it work better in a small data situation.s = "lambda.1se"
also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).