Solved – Implementing ridge regression: Selecting an intelligent grid for $\lambda$

ridge regression

I'm implementing Ridge Regression in a Python/C module, and I've come across this "little" problem. The idea is that I want to sample the effective degrees of freedom more or less equally spaced (like the plot on page 65 on the "Elements of Statistical Learning"), i.e., sample:
$$\mathrm{df}(\lambda)=\sum_{i=1}^{p}\frac{d_i^2}{d_i^2+\lambda},$$
where $d_i^2$ are the eigenvalues of the matrix $X^TX$, from $\mathrm{df}(\lambda_{\max})\approx 0$ to $\mathrm{df}(\lambda_{\min})=p$. An easy way to set the first limit is to let $\lambda_{\max}=\sum_i^p d_i^2/c$ (assuming $\lambda_{\max} \gg d_i^2$), where $c$ is a small constant and represents aproximately the minimum degree of freedom that you want to sample (e.g. $c=0.1$). The second limit is of course $\lambda_{\min}=0$.

As the title suggests, then, I need to sample $\lambda$ from $\lambda_{\min}$ to $\lambda_{\max}$ in some scale such that $\mathrm{df}(\lambda)$ is sampled (approximately), say, in $0.1$ intervals from $c$ to $p$…is there an easy way to do this? I thought solving the equation $\mathrm{df}(\lambda)$ for each $\lambda$ using a Newton-Raphson method, but this will add too much iterations, specially when $p$ is large. Any suggestions?

Best Answer

This is a long answer. So, let's give a short-story version of it here.

  • There's no nice algebraic solution to this root-finding problem, so we need a numerical algorithm.
  • The function $\mathrm{df}(\lambda)$ has lots of nice properties. We can harness these to create a specialized version of Newton's method for this problem with guaranteed monotonic convergence to each root.
  • Even brain-dead R code absent any attempts at optimization can compute a grid of size 100 with $p = 100\,000$ in a few of seconds. Carefully written C code would reduce this by at least 2–3 orders of magnitude.

There are two schemes given below to guarantee monotonic convergence. One uses bounds shown below, which seem to help save a Newton step or two on occasion.

Example: $p = 100\,000$ and a uniform grid for the degrees of freedom of size 100. The eigenvalues are Pareto-distributed, hence highly skewed. Below are tables of the number of Newton steps to find each root.

# Table of Newton iterations per root.
# Without using lower-bound check.
  1  3  4  5  6 
  1 28 65  5  1 
# Table with lower-bound check.
  1  2  3 
  1 14 85 

There won't be a closed-form solution for this, in general, but there is a lot of structure present which can be used to produce very effective and safe solutions using standard root-finding methods.

Before digging too deeply into things, let's collect some properties and consequences of the function $$\newcommand{\df}{\mathrm{df}} \df(\lambda) = \sum_{i=1}^p \frac{d_i^2}{d_i^2 + \lambda} \>. $$

Property 0: $\df$ is a rational function of $\lambda$. (This is apparent from the definition.)
Consequence 0: No general algebraic solution will exist for finding the root $\df(\lambda) - y = 0$. This is because there is an equivalent polynomial root-finding problem of degree $p$ and so if $p$ is not extremely small (i.e., less than five), no general solution will exist. So, we'll need a numerical method.

Property 1: The function $\df$ is convex and decreasing on $\lambda \geq 0$. (Take derivatives.)
Consequence 1(a): Newton's root-finding algorithm will behave very nicely in this situation. Let $y$ be the desired degrees of freedom and $\lambda_0$ the corresponding root, i.e., $y = \df(\lambda_0)$. In particular, if we start out with any initial value $\lambda_1 < \lambda_0$ (so, $\df(\lambda_1) > y$), then the sequence of Newton-step iterations $\lambda_1,\lambda_2,\ldots$ will converge monotonically to the unique solution $\lambda_0$.
Consequence 1(b): Furthermore, if we were to start out with $\lambda_1 > \lambda_0$, then the first step would yield $\lambda_2 \leq \lambda_0$, from whence it will monotonically increase to the solution by the previous consequence (see caveat below). Intuitively, this last fact follows because if we start to the right of the root, the derivative is "too" shallow due to the convexity of $\df$ and so the first Newton step will take us somewhere to the left of the root. NB Since $\df$ is not in general convex for negative $\lambda$, this provides a strong reason to prefer starting to the left of the desired root. Otherwise, we need to double check that the Newton step hasn't resulted in a negative value for the estimated root, which may place us somewhere in a nonconvex portion of $\df$.
Consequence 1(c): Once we've found the root for some $y_1$ and are then searching for the root from some $y_2 < y_1$, using $\lambda_1$ such that $\df(\lambda_1) = y_1$ as our initial guess guarantees we start to the left of the second root. So, our convergence is guaranteed to be monotonic from there.

Property 2: Reasonable bounds exist to give "safe" starting points. Using convexity arguments and Jensen's inequality, we have the following bounds $$ \frac{p}{1+ \frac{\lambda}{p}\sum d_i^{-2}} \leq \df(\lambda) \leq \frac{p \sum_i d_i^2}{\sum_i d_i^2 + p \lambda} \>. $$ Consequence 2: This tells us that the root $\lambda_0$ satisfying $\df(\lambda_0) = y$ obeys $$ \frac{1}{\frac{1}{p}\sum_i d_i^{-2}}\left(\frac{p - y}{y}\right) \leq \lambda_0 \leq \left(\frac{1}{p}\sum_i d_i^2\right) \left(\frac{p - y}{y}\right) \>. \tag{$\star$} $$ So, up to a common constant, we've sandwiched the root in between the harmonic and arithmetic means of the $d_i^2$.

This assumes that $d_i > 0$ for all $i$. If this is not the case, then the same bound holds by considering only the positive $d_i$ and replacing $p$ by the number of positive $d_i$. NB: Since $\df(0) = p$ assuming all $d_i > 0$, then $y \in (0,p]$, whence the bounds are always nontrivial (e.g., the lower bound is always nonnegative).

Here is a plot of a "typical" example of $\df(\lambda)$ with $p = 400$. We've superimposed a grid of size 10 for the degrees of freedom. These are the horizontal lines in the plot. The vertical green lines correspond to the lower bound in $(\star)$.

Example dof plot with grid and bounds

An algorithm and some example R code

A very efficient algorithm given a grid of desired degrees of freedom $y_1, \ldots y_n$ in $(0,p]$ is to sort them in decreasing order and then sequentially find the root of each, using the previous root as the starting point for the following one. We can refine this further by checking if each root is greater than the lower bound for the next root, and, if not, we can start the next iteration at the lower bound instead.

Here is some example code in R, with no attempts made to optimize it. As seen below, it is still quite fast even though R is—to put it politely—horrifingly, awfully, terribly slow at loops.

# Newton's step for finding solutions to regularization dof.

dof <- function(lambda, d) { sum(1/(1+lambda / (d[d>0])^2)) }
dof.prime <- function(lambda, d) { -sum(1/(d[d>0]+lambda / d[d>0])^2) }

newton.step <- function(lambda, y, d)
{ lambda - (dof(lambda,d)-y)/dof.prime(lambda,d) }

# Full Newton step; Finds the root of y = dof(lambda, d).
newton <- function(y, d, lambda = NA, tol=1e-10, smart.start=T)
{
    if( is.na(lambda) || smart.start )
        lambda <- max(ifelse(is.na(lambda),0,lambda), (sum(d>0)/y-1)/mean(1/(d[d>0])^2))
    iter <- 0
    yn   <- Inf
    while( abs(y-yn) > tol )
    {
        lambda <- max(0, newton.step(lambda, y, d)) # max = pedantically safe
        yn <- dof(lambda,d)
        iter = iter + 1
    }
    return(list(lambda=lambda, dof=y, iter=iter, err=abs(y-yn)))
}

Below is the final full algorithm which takes a grid of points, and a vector of the $d_i$ (not $d_i^2$!).

newton.grid <- function(ygrid, d, lambda=NA, tol=1e-10, smart.start=TRUE)
{
    p <- sum(d>0)
    if( any(d < 0) || all(d==0) || any(ygrid > p) 
        || any(ygrid <= 0) || (!is.na(lambda) && lambda < 0) )
        stop("Don't try to fool me. That's not nice. Give me valid inputs, please.")
    ygrid <- sort(ygrid, decreasing=TRUE)
    out    <- data.frame()
    lambda <- NA
    for(y in ygrid)
    {
        out <- rbind(out, newton(y,d,lambda, smart.start=smart.start))
        lambda <- out$lambda[nrow(out)]
    }
    out
}

Sample function call

set.seed(17)
p <- 100000
d <- sqrt(sort(exp(rexp(p, 10)),decr=T))
ygrid <- p*(1:100)/100
# Should take ten seconds or so.
out <- newton.grid(ygrid,d)