I think your suspicion is correct - the intercept term does need to be updated for all glmnets except for elastic net.
Here's a link to the glmnet fortran code. The function that fits logistic nets begins on line 2039. If you begin there, and search downwards for a0
, which is the vector containing the intercept values for each $\lambda$:
# From the source comments
a0(lmu) = intercept values for each solution
you will get to:
a0(ic,k)=a0(ic,k)-dot_product(ca(1:nk,ic,k),xm(ia(1:nk)))
k
is the lambda index, ca
is the vector of all coefficients:
# From the source comments
ca(nx,lmu) = compressed coefficient values for each solution
And I believe xm
is the sample weighted design matrix:
xm(j)=dot_product(w,x(:,j))
This looks a lot like an update rule for the intercept term, which should be of the same form as the other coefficients.
I feel your pain in looking at the FORTRAN code. It scares the crap out of me.
They are talking about the same thing. They simply used different notations and one is a particular case of the other one.
I'll start with The Elements of Statistical Learning which is the general case. We have:
$$\hat{\beta} = (X^TX)^{-1}X^Ty$$
Here $\hat{\beta}$ is a vector of the form $(\hat{\beta_1},\hat{\beta_2},..\hat{\beta_p})$ and is the vector of fitted coefficients for a linear regression with $p$ variables, including intercept. We also have $X$ the design matrix having each $x_i$ as columns and $y$ the vector with the independent variable. Those equations are well known and sometimes are named normal equations.
Let's move to the ITSL book. The exposition from there discuss a particular case of a multivariate linear regression. Specifically, it describes the linear regression with a single dependent variable and an intercept. That means in out case the design matrix $X$ has two columns: the intercept (all ones) and the single dependent variable $x$. So, $X = \begin{bmatrix}1 &x\end{bmatrix}$. Also we have $\hat{\beta}$ is the vector of the two fitted model parameters, so $\hat{\beta}=\begin{bmatrix}\hat{\beta_0} & \hat{\beta_1}\end{bmatrix}^T$ or in your notation $\begin{bmatrix}\hat{B_0} & \hat{B_1}\end{bmatrix}^T$. I will use beta instead of B, since I am more comfortable with it.
A preliminary calculus shows us that:
$$\begin{bmatrix}1 &x\end{bmatrix}^T \begin{bmatrix}1 &x\end{bmatrix}
= \begin{bmatrix}n & \sum x \\ \sum x & n\end{bmatrix} = n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix}$$
Here we used the fact that:
$$\begin{bmatrix}1 &x\end{bmatrix}^T \begin{bmatrix}1 &x\end{bmatrix}=\begin{bmatrix}1 & 1 &.. & 1 \\ x_1 & x_2 & .. & x_n\end{bmatrix}\begin{bmatrix}1 & x_1 \\ 1 & x_2 \\ .. & .. \\1 & x_n\end{bmatrix} =
\begin{bmatrix}n & \sum x\\\sum x & x^Tx\end{bmatrix}=
n\begin{bmatrix}1 & \bar{x} \\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix}$$
Considering that we now have the normal equations for your particular case as
$$\begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix}
= (n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix})^{-1}
\begin{bmatrix}1 & x\end{bmatrix}^T y$$
Notice is not easy to invert in formula the covariance matrix, so we will multiply with that matrix on the right to get rid of the inverse. Thus we will obtain:
$$n\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix} \begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix}
= \begin{bmatrix}1 & x\end{bmatrix}^T y$$
Moving $n$ to the right we have
$$\begin{bmatrix}1 & \bar{x}\\ \bar{x} & \frac{x^Tx}{n}\end{bmatrix} \begin{bmatrix}\hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix}
= \begin{bmatrix}\bar{y} \\ \frac{x^Ty}{n}\end{bmatrix}$$
What we obtained right now is a system of two equations, so both can be used. The first equation is what you already have:
$$\hat{\beta_0}-\bar{x}\hat{\beta_1} = \bar{y}$$
I am convinced that the second equations after some substitutions looks the way you saw it in the book.
As a conclusion the ISTL talks about a particular case and all the beta coefficients are scalars, and the other description works for generic case and beta from there is a vector of coefficients. Hope that helped.
Best Answer
For the first question, recall that in centering we replace each value $y_i$ with $y_i - \bar y$, where $\bar y$ is the mean of the $y$ vector. Then
$$ \sum_i (y_i - \bar y) = \sum_i y_i - n \bar y = \sum_i y_i - \sum_i y_i = 0 $$
Well, no. First $\bar y = 0$ is only true is $y$ is the centered response, like this
$$ \overline{y - \bar y} = 0 $$
the mean of the centered response is zero. This relies on two facts:
These two together mean that $\sum_i \bar y = n \bar y$, as we have only added a constant to itself $n$ times. Using the definition of the mean $n \bar y = \sum_i y_i$. All this together gives the cancellation I discussed in above.
Now, if $y$ is centered, then the intercept in LASSO is always zero. To see why, recall that LASSO is attempting to minimize
$$ L(\beta) = \sum_i (y_i - \sum_j \beta_j x_{ij})^2 + \lambda \sum_{j > 0} \left| \beta_j \right| $$
The intercept parameter does not appear in the penalty term, which is very important here. Since the penalty term is the only place the absolute values appear, the loss function is differentiable with respect to the intercept parameter. Therefore this partial derivative must be zero at any extrema (maximum or minimum) of the loss function. So let's compute this partial derivative
$$ \frac{\partial L}{\partial \beta_0} = -2 \sum_i(y_i - \sum_j \beta_j x_{ij}) x_{i0} $$
Now $x_{i0} = 1$ for all $i$, because this is the intercept column of $X$. Furthermore, $\sum_i x_{ij} = 0$ for all other $j$, because we always center all of the other (non-intercept) predictors. With these two relations we can simplify the above to
$$ \frac{\partial L}{\partial \beta_0} = -2 \sum_i y_i + 2 \sum_{j > 0} \beta_j \sum_i x_{ij} + 2 n \beta_0 = - 0 + 0 + 2 n \beta_0 $$
Setting to zero
$$ \beta_0 = 0 $$
Notice that this happens regardless of lambda.
Well, I think I broke down the general case as far as I am capable of, so maybe simplifying the problem itself a bit will illuminate whats going on. Consider the one variable case of lasso, we have one predictor $x$, and two coefficients to estimate, the intercept $\beta_0$ and the coefficient of $x$, $\beta_1$.
The loss function in this simplified case is easier to write out in explicit detail
$$ L = \sum_i (y_i - \beta_0 - \beta_1 x_i)^2 - \lambda | \beta_1 | $$
The important feature here is that $\beta_0$ only appears in the sum of squares term, not in the penalty. So $L$ is a differentiable function with respect to the variable $\beta_0$, but is not differentiable with respect to the variable $\beta_1$. So, at a minimum, we can be guaranteed that the partial derivative with respect to $\beta_0$ is zero, but we can make no such claim about $\beta_1$.
To actually compute this partial is pretty simple using the standard rules of differential calculus: the differential of the penalty term is zero because $\beta_0$ does not appear, and then it is one application of the chain rule to take the derivative of the sum of squares term
$$ \frac{\partial L}{\partial \beta_0} = -2 \sum_i (y_i - \beta_0 - \beta_1 x_i) $$
Using the fact that there response and predictors are centered, this expression reduces to
$$ 2 n \beta_0 $$
It is in there. There are two terms to the loss function being minimized, the residual sum of squares and and the penalty. It is true that $\beta_0$ does not appear in the penalty term, but it does appear in the residual sum of squares. Notice that the sum in the residual sum of squares is over all $j$, while in the penalty term it is only over those $j > 0$.
For the second question, the point the books are trying to make is that there is no closed form solution for the LASSO coefficients. That is, there is no explicit, purely algebraic, procedure for determining the coefficients. For contrast, there is an algebraic procedure for solving linear regression, you just solve the equation
$$ X^t X \beta = X^t y $$
Even if there is no algebraic solution, that does not mean that there is no algorithmic solution. This is why iterative methods are used to solve the LASSO problem.
Be careful, the estimated coefficients are equal to the minimizer of the expression, not the expression itself.
As for the question, it depends on what you mean by "mathematically". You can not find them exactly by mathematical manipulation, moving around symbols and taking square roots and such. You can find them (to any desired accuracy) using iterative optimization methods, which is what is generally done is software pacakges. Also, somewhat miraculously, you can find them exactly using some very clever geometrical arguments, which is explained in the LARS paper. Unfortunately, this is only possible in very nice cases, and fails for more general models, like logistic or poisson regressions.