Solved – How to calculate regularization parameter in ridge regression given degrees of freedom and input matrix

ridge regression

Let A be $n \times p$ matrix of independent variables and B be the corresponding $n \times 1$ matrix of the dependent values. In ridge regression, we define a parameter $\lambda$ so that: $\beta=(A^\mathrm{T}A+\lambda I)^{-1}A^\mathrm{T}B$ .
Now let [u s v]=svd(A) and $d_{i}=i^{th}$ diagonal entry of 's'. we define degrees of freedom (df)= $\sum_{i=1}^{n} \frac{(d_{i})^2}{(d_{i})^2+\lambda}$ . Ridge regression shrinks the coefficients of low-variance components and hence the parameter $\lambda$ controls the degrees of freedom.So for $\lambda=0$, which is the case of normal regression, df=n, and hence all the independent variables will be considered. The problem I am facing is to find the value of $\lambda$ given 'df' and the matrix 's'. I have tried to re-arrange the above equation but was not getting a closed form solution. Please provide any helpful pointers.

Best Answer

A Newton-Raphson/Fisher-scoring/Taylor-series algorithm would be suited to this.

You have the equation to solve for $\lambda$ $$h(\lambda)=\sum_{i=1}^{p}\frac{d_{i}^{2}}{d_{i}^{2}+\lambda}-df=0$$ with derivative $$\frac{\partial h}{\partial \lambda}=-\sum_{i=1}^{p}\frac{d_{i}^{2}}{(d_{i}^{2}+\lambda)^{2}}$$ You then get: $$h(\lambda)\approx h(\lambda^{(0)})+(\lambda-\lambda^{(0)})\frac{\partial h}{\partial \lambda}|_{\lambda=\lambda^{(0)}}=0$$

re-arranging for $\lambda$ you get: $$\lambda=\lambda^{(0)}-\left[\frac{\partial h}{\partial \lambda}|_{\lambda=\lambda^{(0)}}\right]^{-1}h(\lambda^{(0)})$$ This sets up the iterative search. For initial starting values, assume $d^{2}_{i}=1$ in the summation, then you get $\lambda^{(0)}=\frac{p-df}{df}$.

$$\lambda^{(j+1)}=\lambda^{(j)}+\left[\sum_{i=1}^{p}\frac{d_{i}^{2}}{(d_{i}^{2}+\lambda^{(j)})^{2}}\right]^{-1}\left[\sum_{i=1}^{p}\frac{d_{i}^{2}}{d_{i}^{2}+\lambda^{(j)}}-df\right]$$

This "goes" in the right direction (increase $\lambda$ when summation is too big, decrease it when too small), and typically only takes a few iterations to solve. Further the function is monotonic (an increase/decrease in $\lambda$ will always decrease/increase the summation), so that it will converge uniquely (no local maxima).

Related Solutions

Solved – Implementing ridge regression: Selecting an intelligent grid for $\lambda$

This is a long answer. So, let's give a short-story version of it here.

There's no nice algebraic solution to this root-finding problem, so we need a numerical algorithm.
The function $\mathrm{df}(\lambda)$ has lots of nice properties. We can harness these to create a specialized version of Newton's method for this problem with guaranteed monotonic convergence to each root.
Even brain-dead R code absent any attempts at optimization can compute a grid of size 100 with $p = 100\,000$ in a few of seconds. Carefully written C code would reduce this by at least 2–3 orders of magnitude.

There are two schemes given below to guarantee monotonic convergence. One uses bounds shown below, which seem to help save a Newton step or two on occasion.

Example: $p = 100\,000$ and a uniform grid for the degrees of freedom of size 100. The eigenvalues are Pareto-distributed, hence highly skewed. Below are tables of the number of Newton steps to find each root.

# Table of Newton iterations per root.
# Without using lower-bound check.
  1  3  4  5  6 
  1 28 65  5  1 
# Table with lower-bound check.
  1  2  3 
  1 14 85

There won't be a closed-form solution for this, in general, but there is a lot of structure present which can be used to produce very effective and safe solutions using standard root-finding methods.

Before digging too deeply into things, let's collect some properties and consequences of the function $$\newcommand{\df}{\mathrm{df}} \df(\lambda) = \sum_{i=1}^p \frac{d_i^2}{d_i^2 + \lambda} \>. $$

Property 0: $\df$ is a rational function of $\lambda$. (This is apparent from the definition.)
Consequence 0: No general algebraic solution will exist for finding the root $\df(\lambda) - y = 0$. This is because there is an equivalent polynomial root-finding problem of degree $p$ and so if $p$ is not extremely small (i.e., less than five), no general solution will exist. So, we'll need a numerical method.

Property 1: The function $\df$ is convex and decreasing on $\lambda \geq 0$. (Take derivatives.)
Consequence 1(a): Newton's root-finding algorithm will behave very nicely in this situation. Let $y$ be the desired degrees of freedom and $\lambda_0$ the corresponding root, i.e., $y = \df(\lambda_0)$. In particular, if we start out with any initial value $\lambda_1 < \lambda_0$ (so, $\df(\lambda_1) > y$), then the sequence of Newton-step iterations $\lambda_1,\lambda_2,\ldots$ will converge monotonically to the unique solution $\lambda_0$.
Consequence 1(b): Furthermore, if we were to start out with $\lambda_1 > \lambda_0$, then the first step would yield $\lambda_2 \leq \lambda_0$, from whence it will monotonically increase to the solution by the previous consequence (see caveat below). Intuitively, this last fact follows because if we start to the right of the root, the derivative is "too" shallow due to the convexity of $\df$ and so the first Newton step will take us somewhere to the left of the root. NB Since $\df$ is not in general convex for negative $\lambda$, this provides a strong reason to prefer starting to the left of the desired root. Otherwise, we need to double check that the Newton step hasn't resulted in a negative value for the estimated root, which may place us somewhere in a nonconvex portion of $\df$.
Consequence 1(c): Once we've found the root for some $y_1$ and are then searching for the root from some $y_2 < y_1$, using $\lambda_1$ such that $\df(\lambda_1) = y_1$ as our initial guess guarantees we start to the left of the second root. So, our convergence is guaranteed to be monotonic from there.

Property 2: Reasonable bounds exist to give "safe" starting points. Using convexity arguments and Jensen's inequality, we have the following bounds $$ \frac{p}{1+ \frac{\lambda}{p}\sum d_i^{-2}} \leq \df(\lambda) \leq \frac{p \sum_i d_i^2}{\sum_i d_i^2 + p \lambda} \>. $$ Consequence 2: This tells us that the root $\lambda_0$ satisfying $\df(\lambda_0) = y$ obeys $$ \frac{1}{\frac{1}{p}\sum_i d_i^{-2}}\left(\frac{p - y}{y}\right) \leq \lambda_0 \leq \left(\frac{1}{p}\sum_i d_i^2\right) \left(\frac{p - y}{y}\right) \>. \tag{$\star$} $$ So, up to a common constant, we've sandwiched the root in between the harmonic and arithmetic means of the $d_i^2$.

This assumes that $d_i > 0$ for all $i$. If this is not the case, then the same bound holds by considering only the positive $d_i$ and replacing $p$ by the number of positive $d_i$. NB: Since $\df(0) = p$ assuming all $d_i > 0$, then $y \in (0,p]$, whence the bounds are always nontrivial (e.g., the lower bound is always nonnegative).

Here is a plot of a "typical" example of $\df(\lambda)$ with $p = 400$. We've superimposed a grid of size 10 for the degrees of freedom. These are the horizontal lines in the plot. The vertical green lines correspond to the lower bound in $(\star)$.

Example dof plot with grid and bounds

An algorithm and some example R code

A very efficient algorithm given a grid of desired degrees of freedom $y_1, \ldots y_n$ in $(0,p]$ is to sort them in decreasing order and then sequentially find the root of each, using the previous root as the starting point for the following one. We can refine this further by checking if each root is greater than the lower bound for the next root, and, if not, we can start the next iteration at the lower bound instead.

Here is some example code in R, with no attempts made to optimize it. As seen below, it is still quite fast even though R is—to put it politely—horrifingly, awfully, terribly slow at loops.

# Newton's step for finding solutions to regularization dof.

dof <- function(lambda, d) { sum(1/(1+lambda / (d[d>0])^2)) }
dof.prime <- function(lambda, d) { -sum(1/(d[d>0]+lambda / d[d>0])^2) }

newton.step <- function(lambda, y, d)
{ lambda - (dof(lambda,d)-y)/dof.prime(lambda,d) }

# Full Newton step; Finds the root of y = dof(lambda, d).
newton <- function(y, d, lambda = NA, tol=1e-10, smart.start=T)
{
    if( is.na(lambda) || smart.start )
        lambda <- max(ifelse(is.na(lambda),0,lambda), (sum(d>0)/y-1)/mean(1/(d[d>0])^2))
    iter <- 0
    yn   <- Inf
    while( abs(y-yn) > tol )
    {
        lambda <- max(0, newton.step(lambda, y, d)) # max = pedantically safe
        yn <- dof(lambda,d)
        iter = iter + 1
    }
    return(list(lambda=lambda, dof=y, iter=iter, err=abs(y-yn)))
}

Below is the final full algorithm which takes a grid of points, and a vector of the $d_i$ (not $d_i^2$!).

newton.grid <- function(ygrid, d, lambda=NA, tol=1e-10, smart.start=TRUE)
{
    p <- sum(d>0)
    if( any(d < 0) || all(d==0) || any(ygrid > p) 
        || any(ygrid <= 0) || (!is.na(lambda) && lambda < 0) )
        stop("Don't try to fool me. That's not nice. Give me valid inputs, please.")
    ygrid <- sort(ygrid, decreasing=TRUE)
    out    <- data.frame()
    lambda <- NA
    for(y in ygrid)
    {
        out <- rbind(out, newton(y,d,lambda, smart.start=smart.start))
        lambda <- out$lambda[nrow(out)]
    }
    out
}

Sample function call

set.seed(17)
p <- 100000
d <- sqrt(sort(exp(rexp(p, 10)),decr=T))
ygrid <- p*(1:100)/100
# Should take ten seconds or so.
out <- newton.grid(ygrid,d)

Solved – Regression in $p>n$ setting: how to choose regularization method (Lasso, PLS, PCR, ridge)

I think there is no single answer to your question - it depends upon many situation, data and what you are trying to do. Some of the modification can be or should be modified to achieve the goal. However the following general discussion can help.

Before jumping to into the more advanced methods let's discussion of basic model first: Least Squares (LS) regression. There are two reasons why a least squares estimate of the parameters in the full model is unsatisfying:

Prediction quality: Least squares estimates often have a small bias but a high variance. The prediction quality can sometimes be improved by shrinkage of the regression coefﬁcients or by setting some coeﬃcients equal to zero. This way the bias increases, but the variance of the prediction reduces significantly which leads to an overall improved prediction. This tradeoﬀ between bias and variance can be easily seen by decomposing the mean squared error (MSE). A smaller MSE lead to a better prediction of new values.
Interpretability: If many predictor variables are available, it makes sense to identify the ones who have the largest inﬂuence, and to set the ones to zero that are not relevant for the prediction. Thus we eliminate variables that will only explain some details, but we keep those which allow for the major explanation of the response variable.

Thus variable selection methods come into the scene. With variable selection only a subset of all input variables is used, the rest is eliminated from the model. Best subset regression ﬁnds the subset of size $k$ for each $k \in \{0, 1, ... , p\}$ that gives the smallest RSS. An eﬃcient algorithm is the so-called Leaps and Bounds algorithm which can handle up to $30$ or $40$ regressor variables. With data sets larger than $40$ input variables a search through all possible subsets becomes infeasible. Thus Forward stepwise selection and Backward stepwise selection are useful. Backward selection can only be used when $n > p$ in order to have a well deﬁned model. The computation efficiency of these methods is questionable when $p$ is very high.

In many situations we have a large number of inputs (as yours), often highly correlated (as in your case). In case of highly correlated regressors, OLS leads to a numerically instable parameters, i.e. unreliable $\beta$ estimates. To avoid this problem, we use methods that use derived input directions. These methods produce a small number of linear combinations $z_k, k = 1, 2, ... , q$ of the original inputs $x_j$ which are then used as inputs in the regression.

The methods diﬀer in how the linear combinations are constructed. Principal components regression (PCR) looks for transformations of the original data into a new set of uncorrelated variables called principal components.

Partial Least Squares (PLS) regression - this technique also constructs a set of linear combinations of the inputs for regression, but unlike principal components regression it uses $y$ in addition to $X$ for this construction. We assume that both $y$ and $X$ are centered. Instead of calculating the parameters $\beta$ in the linear model, we estimate the parameters $\gamma$ in the so-called latent variable mode. We assume the new coeﬃcients $\gamma$ are of dimension $q \le p$. PLS does a regression on a weighted version of $X$ which contains incomplete or partial information. Since PLS uses also $y$ to determine the PLS-directions, this method is supposed to have better prediction performance than for instance PCR. In contrast to PCR, PLS is looking for directions with high variance and large correlation with $y$.

Shrinkage methods keep all variables in the model and assign diﬀerent (continuous) weights. In this way we obtain a smoother procedure with a smaller variability. Ridge regression shrinks the coeﬃcients by imposing a penalty on their size. The ridge coeﬃcients minimize a penalized residual sum of squares. Here $\lambda \ge 0$ is a complexity parameter that controls the amount of shrinkage: the larger the value of $\lambda$, the greater the amount of shrinkage. The coeﬃcients are shrunk towards zero (and towards each other).

By penalizing the RSS we try to avoid that highly correlated regressors cancel each other. An especially large positive coeﬃcient $\beta$ can be canceled by a similarly large negative coeﬃcient $\beta$. By imposing a size constraint on the coeﬃcients this phenomenon can be prevented.

It can be shown that PCR is very similar to ridge regression: both methods use the principal components of the input matrix $X$. Ridge regression shrinks the coeﬃcients of the principal components, the shrinkage depends on the corresponding eigenvalues; PCR completely discards the components to the smallest $p - q$ eigenvalues.

The lasso is a shrinkage method like ridge, but the L1 norm rather than the L2 norm is used in the constraints. L1-norm loss function is also known as least absolute deviations (LAD), least absolute errors (LAE). It is basically minimizing the sum of the absolute differences between the target value and the estimated values. L2-norm loss function is also known as least squares error (LSE). It is basically minimizing the sum of the square of the differences between the target value ($Y_i$) and the estimated values. The difference between the L1 and L2 is just that L2 is the sum of the square of the weights, while L1 is just the sum of the weights. L1-norm tends to produces sparse coefficients and has Built-in feature selection. L1-norm does not have an analytical solution, but L2-norm does. This allows the L2-norm solutions to be calculated computationally efficiently. L2-norm has unique solutions while L1-norm does not.

Lasso and ridge diﬀer in their penalty term. The lasso solutions are nonlinear and a quadratic programming algorithm is used to compute them. Because of the nature of the constraint, making $s$ suﬃciently small will cause some of the coeﬃcients to be exactly $0$. Thus the lasso does a kind of continuous subset selection. Like the subset size in subset selection, or the penalty in ridge regression, $s$ should be adaptly chosen to minimize an estimate of expected prediction error.

When $p\gg N$ , high variance and over fitting are a major concern in this setting. As a result, simple, highly regularized approaches often become the methods of choice.

Principal components analysis is an effective method for finding linear combinations of features that exhibit large variation in a dataset. But what we seek here are linear combinations with both high variance and significant correlation with the outcome. Hence we want to encourage principal component analysis to find linear combinations of features that have high correlation with the outcome - supervised principal components (see page 678, Algorithm 18.1, in the book Elements of Statistical Learning).

Partial least squares down weights noisy features, but does not throw them away; as a result a large number of noisy features can contaminate the predictions. Thresholded PLS can be viewed as a noisy version of supervised principal components, and hence we might not expect it to work as well in practice. Supervised principal components can yield lower test errors than Threshold PLS. However, it does not always produce a sparse model involving only a small number of features.

The lasso on the other hand, produces a sparse model from the data. Ridge can always perform average. I think lasso is good choice when there are large number $p$. Supervised principal component can also work well.

Best Answer

Related Solutions

Solved – Implementing ridge regression: Selecting an intelligent grid for $\lambda$

Solved – Regression in $p>n$ setting: how to choose regularization method (Lasso, PLS, PCR, ridge)

Related Question