Can I say something about the propensity to overfit in (A) versus (B)?
Provided that both grids cover a sufficient range, grid fineness doesn't really have anything to do with overfitting in this problem (though a coarse grid might underfit if it skips over a profitable interval). It's not as if testing too many values will somehow change what out-of-sample looks like.* In the case of these penalized regressions, we definitely want to optimize our penalized likelihood function for values $\lambda$, and it doesn't matter how many values of $\lambda$ we test, because out-of-sample performance for a fixed data set and fixed partitioning is entirely deterministic. More to the point, the out-of-sample metric is not at all altered by how many values $\lambda$ you test. A coarser grid might mean that you skip over the absolute minimum in your out-of-sample metric, but finding the absolute minimum probably isn't desirable in the first place because hyperparameters tend to be poorly estimated, and finite sample properties mean that data limitations will be a source noise in that estimate that will overwhelm slight changes in the distance between adjacent grid points: the standard error of your estimate will tend to swamp differences in grid fineness.
If you're truely concerned that out-of-sample performance metric might be overly optimistic, you could adopt the 1 standard error rule, which picks the most regularized model within 1 standard error of the minimum. That way, you're being slightly more conservative and picking a less complex model.
Can I determine the optimal grid fineness? How?
The LARS algorithm does not a priori define which values of $\lambda$ to check; rather, $\lambda$ is changed continuously and the algorithm checks for values of $\lambda$ for which a coefficient goes from 0 to a nonzero value. Those values of $\lambda$ where a new coefficient is nonzero are retained, with the observation that coefficient paths are piecewise linear in the case of the lasso, so there's no loss of information by just storing off the knots in that case. LARS only works when coefficient paths are piecewise linear, though. The ridge penalty never shrinks a coefficient to precisely zero, so all of your coefficient paths are smooth and always nonzero; likewise elastic net regressions (excluding the case of elastic net regressions which are also lasso regressions).
But most people use GLMNET because it's often faster. In terms of determining what grid of $\lambda$ to search over, I recommend reading the GLMNET article "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie, and Rob Tibshirani. In it, they develop a very efficient algorithm for estimating ridge, lasso and elastic net regressions. The algorithm checks for a value of $\lambda_\text{max}$ for which $\beta$ is the zero vector, and then identifies a minimum value $\lambda_\text{min}$ relative to $\lambda_\text{max}$. Finally, they generate a sequence of values between the two uniformly on the log scale. This grid is sufficient for most purposes, though it does omit the property that you will know precisely when a coefficient is estimated at a nonzero value. Warm starts are used to provide solutions much more quickly, and it supports many common GLMs.
*You might be thinking about this from the perspective of an artificial neural network, where early stopping is sometimes used to accomplish regularization, but that's an entirely unrelated problem (namely, that the optimization algorithm is prevented from reaching an optimum, so the model is forced to be less complex).
Glmnet is returning a very large optimal regularization parameter, i.e., it is regularizing away all of your coefficients. It looks like glmnet is telling you that, after accounting for your prior (or offset) coefficients, what is left is noise. That is, you already offset the correct coefficients, and the model is just validating that.
Best Answer
This is a long answer. So, let's give a short-story version of it here.
R
code absent any attempts at optimization can compute a grid of size 100 with $p = 100\,000$ in a few of seconds. Carefully writtenC
code would reduce this by at least 2–3 orders of magnitude.There are two schemes given below to guarantee monotonic convergence. One uses bounds shown below, which seem to help save a Newton step or two on occasion.
Example: $p = 100\,000$ and a uniform grid for the degrees of freedom of size 100. The eigenvalues are Pareto-distributed, hence highly skewed. Below are tables of the number of Newton steps to find each root.
There won't be a closed-form solution for this, in general, but there is a lot of structure present which can be used to produce very effective and safe solutions using standard root-finding methods.
Before digging too deeply into things, let's collect some properties and consequences of the function $$\newcommand{\df}{\mathrm{df}} \df(\lambda) = \sum_{i=1}^p \frac{d_i^2}{d_i^2 + \lambda} \>. $$
Property 0: $\df$ is a rational function of $\lambda$. (This is apparent from the definition.)
Consequence 0: No general algebraic solution will exist for finding the root $\df(\lambda) - y = 0$. This is because there is an equivalent polynomial root-finding problem of degree $p$ and so if $p$ is not extremely small (i.e., less than five), no general solution will exist. So, we'll need a numerical method.
Property 1: The function $\df$ is convex and decreasing on $\lambda \geq 0$. (Take derivatives.)
Consequence 1(a): Newton's root-finding algorithm will behave very nicely in this situation. Let $y$ be the desired degrees of freedom and $\lambda_0$ the corresponding root, i.e., $y = \df(\lambda_0)$. In particular, if we start out with any initial value $\lambda_1 < \lambda_0$ (so, $\df(\lambda_1) > y$), then the sequence of Newton-step iterations $\lambda_1,\lambda_2,\ldots$ will converge monotonically to the unique solution $\lambda_0$.
Consequence 1(b): Furthermore, if we were to start out with $\lambda_1 > \lambda_0$, then the first step would yield $\lambda_2 \leq \lambda_0$, from whence it will monotonically increase to the solution by the previous consequence (see caveat below). Intuitively, this last fact follows because if we start to the right of the root, the derivative is "too" shallow due to the convexity of $\df$ and so the first Newton step will take us somewhere to the left of the root. NB Since $\df$ is not in general convex for negative $\lambda$, this provides a strong reason to prefer starting to the left of the desired root. Otherwise, we need to double check that the Newton step hasn't resulted in a negative value for the estimated root, which may place us somewhere in a nonconvex portion of $\df$.
Consequence 1(c): Once we've found the root for some $y_1$ and are then searching for the root from some $y_2 < y_1$, using $\lambda_1$ such that $\df(\lambda_1) = y_1$ as our initial guess guarantees we start to the left of the second root. So, our convergence is guaranteed to be monotonic from there.
Property 2: Reasonable bounds exist to give "safe" starting points. Using convexity arguments and Jensen's inequality, we have the following bounds $$ \frac{p}{1+ \frac{\lambda}{p}\sum d_i^{-2}} \leq \df(\lambda) \leq \frac{p \sum_i d_i^2}{\sum_i d_i^2 + p \lambda} \>. $$ Consequence 2: This tells us that the root $\lambda_0$ satisfying $\df(\lambda_0) = y$ obeys $$ \frac{1}{\frac{1}{p}\sum_i d_i^{-2}}\left(\frac{p - y}{y}\right) \leq \lambda_0 \leq \left(\frac{1}{p}\sum_i d_i^2\right) \left(\frac{p - y}{y}\right) \>. \tag{$\star$} $$ So, up to a common constant, we've sandwiched the root in between the harmonic and arithmetic means of the $d_i^2$.
This assumes that $d_i > 0$ for all $i$. If this is not the case, then the same bound holds by considering only the positive $d_i$ and replacing $p$ by the number of positive $d_i$. NB: Since $\df(0) = p$ assuming all $d_i > 0$, then $y \in (0,p]$, whence the bounds are always nontrivial (e.g., the lower bound is always nonnegative).
Here is a plot of a "typical" example of $\df(\lambda)$ with $p = 400$. We've superimposed a grid of size 10 for the degrees of freedom. These are the horizontal lines in the plot. The vertical green lines correspond to the lower bound in $(\star)$.
An algorithm and some example R code
A very efficient algorithm given a grid of desired degrees of freedom $y_1, \ldots y_n$ in $(0,p]$ is to sort them in decreasing order and then sequentially find the root of each, using the previous root as the starting point for the following one. We can refine this further by checking if each root is greater than the lower bound for the next root, and, if not, we can start the next iteration at the lower bound instead.
Here is some example code in
R
, with no attempts made to optimize it. As seen below, it is still quite fast even thoughR
is—to put it politely—horrifingly, awfully, terribly slow at loops.Below is the final full algorithm which takes a grid of points, and a vector of the $d_i$ (not $d_i^2$!).
Sample function call