Solved – Maximum penalty for ridge regression

cross-validationregressionregularizationridge regression

Consider a regression model

$$ y = X \beta + \varepsilon. $$

I will use ridge regression to estimate $\beta$. Ridge regression contains a tuning parameter (the penalty intensity) $\lambda$. If I were given a grid of candidate $\lambda$ values, I would use cross validation to select the optimal $\lambda$. However, the grid is not given, so I need to design it first. For that I need to choose, among other things, a maximum value $\lambda_{max}$.

Question: How do I sensibly choose $\lambda_{max}$ in ridge regression?

There needs to be a balance between

a $\lambda_{max}$ that is "too large", leading to wasteful computations when evaluating the performance of (possibly many) models that are penalized too harshly;
a $\lambda_{max}$ that is "too small" leading to a forgone opportunity to penalize more intensely and get better performance.

(Note that the answer is simple in the case of LASSO; there you take $\lambda_{max}$ such that all coefficients are set exactly to zero for any $\lambda \geq \lambda_{max}$.)

Best Answer

The effect of $\lambda$ in the ridge regression estimator is that it "inflates" singular values $s_i$ of $X$ via terms like $(s^2_i+\lambda)/s_i$. Specifically, if SVD of the design matrix is $X=USV^\top$, then $$\hat\beta_\mathrm{ridge} = V^\top \frac{S}{S^2+\lambda I} U y.$$ This is explained multiple times on our website, see e.g. @whuber's detailed exposition here: The proof of shrinking coefficients using ridge regression through "spectral decomposition".

This suggests that selecting $\lambda$ much larger than $s_\mathrm{max}^2$ will shrink everything very strongly. I suspect that $$\lambda=\|X\|_2^2=\sum s_i^2$$ will be too big for all practical purposes.

I usually normalize my lambdas by the squared Frobenius norm of $X$ and have a cross-validation grid that goes from $0$ to $1$ (on a log scale).

Having said that, no value of lambda can be seen as truly "maximum", in contrast to the lasso case. Imagine that predictors are exactly orthogonal to the response, i.e. that the true $\beta=0$. Any finite value of $\lambda<\infty $ for any finite value of sample size $n$ will yield $\hat \beta \ne 0$ and hence could benefit from stronger shrinkage.

Related Solutions

Solved – Implementing ridge regression: Selecting an intelligent grid for $\lambda$

This is a long answer. So, let's give a short-story version of it here.

There's no nice algebraic solution to this root-finding problem, so we need a numerical algorithm.
The function $\mathrm{df}(\lambda)$ has lots of nice properties. We can harness these to create a specialized version of Newton's method for this problem with guaranteed monotonic convergence to each root.
Even brain-dead R code absent any attempts at optimization can compute a grid of size 100 with $p = 100\,000$ in a few of seconds. Carefully written C code would reduce this by at least 2–3 orders of magnitude.

There are two schemes given below to guarantee monotonic convergence. One uses bounds shown below, which seem to help save a Newton step or two on occasion.

Example: $p = 100\,000$ and a uniform grid for the degrees of freedom of size 100. The eigenvalues are Pareto-distributed, hence highly skewed. Below are tables of the number of Newton steps to find each root.

# Table of Newton iterations per root.
# Without using lower-bound check.
  1  3  4  5  6 
  1 28 65  5  1 
# Table with lower-bound check.
  1  2  3 
  1 14 85

There won't be a closed-form solution for this, in general, but there is a lot of structure present which can be used to produce very effective and safe solutions using standard root-finding methods.

Before digging too deeply into things, let's collect some properties and consequences of the function $$\newcommand{\df}{\mathrm{df}} \df(\lambda) = \sum_{i=1}^p \frac{d_i^2}{d_i^2 + \lambda} \>. $$

Property 0: $\df$ is a rational function of $\lambda$. (This is apparent from the definition.)
Consequence 0: No general algebraic solution will exist for finding the root $\df(\lambda) - y = 0$. This is because there is an equivalent polynomial root-finding problem of degree $p$ and so if $p$ is not extremely small (i.e., less than five), no general solution will exist. So, we'll need a numerical method.

Property 1: The function $\df$ is convex and decreasing on $\lambda \geq 0$. (Take derivatives.)
Consequence 1(a): Newton's root-finding algorithm will behave very nicely in this situation. Let $y$ be the desired degrees of freedom and $\lambda_0$ the corresponding root, i.e., $y = \df(\lambda_0)$. In particular, if we start out with any initial value $\lambda_1 < \lambda_0$ (so, $\df(\lambda_1) > y$), then the sequence of Newton-step iterations $\lambda_1,\lambda_2,\ldots$ will converge monotonically to the unique solution $\lambda_0$.
Consequence 1(b): Furthermore, if we were to start out with $\lambda_1 > \lambda_0$, then the first step would yield $\lambda_2 \leq \lambda_0$, from whence it will monotonically increase to the solution by the previous consequence (see caveat below). Intuitively, this last fact follows because if we start to the right of the root, the derivative is "too" shallow due to the convexity of $\df$ and so the first Newton step will take us somewhere to the left of the root. NB Since $\df$ is not in general convex for negative $\lambda$, this provides a strong reason to prefer starting to the left of the desired root. Otherwise, we need to double check that the Newton step hasn't resulted in a negative value for the estimated root, which may place us somewhere in a nonconvex portion of $\df$.
Consequence 1(c): Once we've found the root for some $y_1$ and are then searching for the root from some $y_2 < y_1$, using $\lambda_1$ such that $\df(\lambda_1) = y_1$ as our initial guess guarantees we start to the left of the second root. So, our convergence is guaranteed to be monotonic from there.

Property 2: Reasonable bounds exist to give "safe" starting points. Using convexity arguments and Jensen's inequality, we have the following bounds $$ \frac{p}{1+ \frac{\lambda}{p}\sum d_i^{-2}} \leq \df(\lambda) \leq \frac{p \sum_i d_i^2}{\sum_i d_i^2 + p \lambda} \>. $$ Consequence 2: This tells us that the root $\lambda_0$ satisfying $\df(\lambda_0) = y$ obeys $$ \frac{1}{\frac{1}{p}\sum_i d_i^{-2}}\left(\frac{p - y}{y}\right) \leq \lambda_0 \leq \left(\frac{1}{p}\sum_i d_i^2\right) \left(\frac{p - y}{y}\right) \>. \tag{$\star$} $$ So, up to a common constant, we've sandwiched the root in between the harmonic and arithmetic means of the $d_i^2$.

This assumes that $d_i > 0$ for all $i$. If this is not the case, then the same bound holds by considering only the positive $d_i$ and replacing $p$ by the number of positive $d_i$. NB: Since $\df(0) = p$ assuming all $d_i > 0$, then $y \in (0,p]$, whence the bounds are always nontrivial (e.g., the lower bound is always nonnegative).

Here is a plot of a "typical" example of $\df(\lambda)$ with $p = 400$. We've superimposed a grid of size 10 for the degrees of freedom. These are the horizontal lines in the plot. The vertical green lines correspond to the lower bound in $(\star)$.

Example dof plot with grid and bounds

An algorithm and some example R code

A very efficient algorithm given a grid of desired degrees of freedom $y_1, \ldots y_n$ in $(0,p]$ is to sort them in decreasing order and then sequentially find the root of each, using the previous root as the starting point for the following one. We can refine this further by checking if each root is greater than the lower bound for the next root, and, if not, we can start the next iteration at the lower bound instead.

Here is some example code in R, with no attempts made to optimize it. As seen below, it is still quite fast even though R is—to put it politely—horrifingly, awfully, terribly slow at loops.

# Newton's step for finding solutions to regularization dof.

dof <- function(lambda, d) { sum(1/(1+lambda / (d[d>0])^2)) }
dof.prime <- function(lambda, d) { -sum(1/(d[d>0]+lambda / d[d>0])^2) }

newton.step <- function(lambda, y, d)
{ lambda - (dof(lambda,d)-y)/dof.prime(lambda,d) }

# Full Newton step; Finds the root of y = dof(lambda, d).
newton <- function(y, d, lambda = NA, tol=1e-10, smart.start=T)
{
    if( is.na(lambda) || smart.start )
        lambda <- max(ifelse(is.na(lambda),0,lambda), (sum(d>0)/y-1)/mean(1/(d[d>0])^2))
    iter <- 0
    yn   <- Inf
    while( abs(y-yn) > tol )
    {
        lambda <- max(0, newton.step(lambda, y, d)) # max = pedantically safe
        yn <- dof(lambda,d)
        iter = iter + 1
    }
    return(list(lambda=lambda, dof=y, iter=iter, err=abs(y-yn)))
}

Below is the final full algorithm which takes a grid of points, and a vector of the $d_i$ (not $d_i^2$!).

newton.grid <- function(ygrid, d, lambda=NA, tol=1e-10, smart.start=TRUE)
{
    p <- sum(d>0)
    if( any(d < 0) || all(d==0) || any(ygrid > p) 
        || any(ygrid <= 0) || (!is.na(lambda) && lambda < 0) )
        stop("Don't try to fool me. That's not nice. Give me valid inputs, please.")
    ygrid <- sort(ygrid, decreasing=TRUE)
    out    <- data.frame()
    lambda <- NA
    for(y in ygrid)
    {
        out <- rbind(out, newton(y,d,lambda, smart.start=smart.start))
        lambda <- out$lambda[nrow(out)]
    }
    out
}

Sample function call

set.seed(17)
p <- 100000
d <- sqrt(sort(exp(rexp(p, 10)),decr=T))
ygrid <- p*(1:100)/100
# Should take ten seconds or so.
out <- newton.grid(ygrid,d)

Solved – Regularization for ARIMA models

Answering Question 1.

Chen & Chan "Subset ARMA selection via the adaptive Lasso" (2011)* use a workaround to avoid the computationally demanding maximum likelihood estimation. Citing the paper, they

propose to find an optimal subset ARMA model by fitting an adaptive Lasso regression of the time series $y_t$ on its own lags and those of the residuals that are obtained from fitting a long autoregression to the $y_t$s. <...> [U]nder mild regularity conditions, the proposed method achieves the oracle properties, namely, it identifies the correct subset ARMA model with probability tending to one as the sample size increases to infinity, and <...> the estimators of the nonzero coefficients are asymptotically normal with the limiting distribution the same as that when the zero coefficients are known a priori.

Optionally, they suggest maximum likelihood estimation and model diagnostics for the selected subset ARMA model(s).

Wilms et al. "Sparse Identification and Estimation of High-Dimensional Vector AutoRegressive Moving Averages" (2017) do even more than I asked for. Instead of a univariate ARIMA model, they take a vector ARMA (VARMA) in high dimensions, and they use an $L_1$ penalty for estimation and lag order selection. They present the estimation algorithm and develop some asymptotic results.

In particular, they employ a two-stage procedure. Consider a VARMA model $$ y_t = \sum_{l=1}^p \Phi_l y_{t-l} + \sum_{m=1}^q \Theta_m \varepsilon_{t-m} + \varepsilon_t $$ which needs to be estimated, but the lag orders $p$ and $q$ are uknown.

In Stage 1, they approximate the VARMA model by a high-order VAR model and estimate it using a Hierarchical VAR estimator which places a lag-based hierarchical group-lasso penalty on the autoregressive parameters.
(The lag order is set to be $\lfloor 1.5\sqrt{T} \rfloor$. The model equations are estimated jointly and the Frobenius norm of the errors $||y-\hat y||_2^F$ is minimized with a hierarchical group-lasso penalty on the regression coefficients.)
They obtain residuals $\hat\varepsilon := y - \hat y$ to be used as proxies for the true errors in Stage 2.
In Stage 2, they estimate a VARX model where X represents lagged residuals from Stage 1. That is, they minic a VARMA model but use estimated residuals in place of true errors, $$ y_t = \sum_{l=1}^{\hat p} \Phi_l y_{t-l} + \sum_{m=1}^{\hat q} \Theta_m \hat\varepsilon_{t-m} + u_t, $$ which allows applying the same estimator (hierarchical group-lasso) again just like in Stage 1.
($\hat p$ and $\hat q$ are set to be $\lfloor 1.5\sqrt{T} \rfloor$.)

The approach of Wilms et al. is implemented in the R package "bigtime".

References

Chen, K., & Chan, K. S. (2011). Subset ARMA selection via the adaptive Lasso. Statistics and its Interface, 4(2), 197-205.
Wilms, I., Basu, S., Bien, J., & Matteson, D. S. (2017). Sparse Identification and Estimation of High-Dimensional Vector AutoRegressive Moving Averages. arXiv preprint arXiv:1707.09208.

^{*Thanks to @hejseb for the link.}

Best Answer

Related Solutions

Solved – Implementing ridge regression: Selecting an intelligent grid for $\lambda$

Solved – Regularization for ARIMA models

Answering Question 1.

Related Question