Solved – Intercept update in logistic regression lasso using coordinate descent: how is it calculated

glmnetlassologistic

I am trying to figure out how the intercept is calculated for logistic regression lasso using coordinate descent algorithm based on this seminal paper:
Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software 33, 1-22 (2010).

In the case of linear regression, the authors clearly stated that the intercept is $\hat{\beta}_0=\bar{y}$, which is very obvious to see. Then, in the logistic regression case, they defined a "working response" $z_i$ which they simply substituted for the $y_i$ for the coordinate update in the linear regression case. However, unlike linear regression, it seems to me that $\beta_0$ also needs to be updated in each loop since this "working response" directly involves both the intercept and all coefficients. The author didn't seem to give a clear clarification on how to do this. So my questions is: how is the intercept updated for the logistic regression? Do I just simply replace $y_i$ with $z_i$ and use $\hat{\beta}_0=\bar{z}$? If so, what's the mathematical justification for it? I'm not familiar with Fortran and their core code for logistic regression was written without any comment/explanation and is just too much for me to wade through.

Thanks.

Best Answer

I think your suspicion is correct - the intercept term does need to be updated for all glmnets except for elastic net.

Here's a link to the glmnet fortran code. The function that fits logistic nets begins on line 2039. If you begin there, and search downwards for a0, which is the vector containing the intercept values for each $\lambda$:

# From the source comments
a0(lmu) = intercept values for each solution

you will get to:

a0(ic,k)=a0(ic,k)-dot_product(ca(1:nk,ic,k),xm(ia(1:nk)))

k is the lambda index, ca is the vector of all coefficients:

# From the source comments
ca(nx,lmu) = compressed coefficient values for each solution

And I believe xm is the sample weighted design matrix:

xm(j)=dot_product(w,x(:,j)) 

This looks a lot like an update rule for the intercept term, which should be of the same form as the other coefficients.

I feel your pain in looking at the FORTRAN code. It scares the crap out of me.