Solved – Grid fineness and overfitting when tuning $\lambda$ in LASSO, ridge, elastic net

elastic netlassooverfittingregularizationridge regression

I wonder about

  • the optimal grid fineness and
  • what the relation between grid fineness and overfitting is

in regularization methods such as LASSO, ridge regression or elastic net.

Suppose I want to fit a regression model using LASSO to a sample of 500 observations (I do not have the data; this is just an example). Suppose also that I have
(A) a grid with 100 different $\lambda$ values in the range between $\lambda_{min}$ and $\lambda_{max}$
(B) a grid with 1000 different $\lambda$ values in the same range,
where $\lambda$ is the parameter controlling the degree of penalization.

Questions:

  1. Can I say something about the propensity to overfit in (A) versus (B)?
  2. Can I determine the optimal grid fineness? How?

Best Answer

Can I say something about the propensity to overfit in (A) versus (B)?

Provided that both grids cover a sufficient range, grid fineness doesn't really have anything to do with overfitting in this problem (though a coarse grid might underfit if it skips over a profitable interval). It's not as if testing too many values will somehow change what out-of-sample looks like.* In the case of these penalized regressions, we definitely want to optimize our penalized likelihood function for values $\lambda$, and it doesn't matter how many values of $\lambda$ we test, because out-of-sample performance for a fixed data set and fixed partitioning is entirely deterministic. More to the point, the out-of-sample metric is not at all altered by how many values $\lambda$ you test. A coarser grid might mean that you skip over the absolute minimum in your out-of-sample metric, but finding the absolute minimum probably isn't desirable in the first place because hyperparameters tend to be poorly estimated, and finite sample properties mean that data limitations will be a source noise in that estimate that will overwhelm slight changes in the distance between adjacent grid points: the standard error of your estimate will tend to swamp differences in grid fineness.

If you're truely concerned that out-of-sample performance metric might be overly optimistic, you could adopt the 1 standard error rule, which picks the most regularized model within 1 standard error of the minimum. That way, you're being slightly more conservative and picking a less complex model.

Can I determine the optimal grid fineness? How?

The LARS algorithm does not a priori define which values of $\lambda$ to check; rather, $\lambda$ is changed continuously and the algorithm checks for values of $\lambda$ for which a coefficient goes from 0 to a nonzero value. Those values of $\lambda$ where a new coefficient is nonzero are retained, with the observation that coefficient paths are piecewise linear in the case of the lasso, so there's no loss of information by just storing off the knots in that case. LARS only works when coefficient paths are piecewise linear, though. The ridge penalty never shrinks a coefficient to precisely zero, so all of your coefficient paths are smooth and always nonzero; likewise elastic net regressions (excluding the case of elastic net regressions which are also lasso regressions).

But most people use GLMNET because it's often faster. In terms of determining what grid of $\lambda$ to search over, I recommend reading the GLMNET article "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie, and Rob Tibshirani. In it, they develop a very efficient algorithm for estimating ridge, lasso and elastic net regressions. The algorithm checks for a value of $\lambda_\text{max}$ for which $\beta$ is the zero vector, and then identifies a minimum value $\lambda_\text{min}$ relative to $\lambda_\text{max}$. Finally, they generate a sequence of values between the two uniformly on the log scale. This grid is sufficient for most purposes, though it does omit the property that you will know precisely when a coefficient is estimated at a nonzero value. Warm starts are used to provide solutions much more quickly, and it supports many common GLMs.


*You might be thinking about this from the perspective of an artificial neural network, where early stopping is sometimes used to accomplish regularization, but that's an entirely unrelated problem (namely, that the optimization algorithm is prevented from reaching an optimum, so the model is forced to be less complex).