Solved – How to calculate regularization parameter in ridge regression given degrees of freedom and input matrix

ridge regression

Let A be $n \times p$ matrix of independent variables and B be the corresponding $n \times 1$ matrix of the dependent values. In ridge regression, we define a parameter $\lambda$ so that: $\beta=(A^\mathrm{T}A+\lambda I)^{-1}A^\mathrm{T}B$ .
Now let [u s v]=svd(A) and $d_{i}=i^{th}$ diagonal entry of 's'. we define degrees of freedom (df)= $\sum_{i=1}^{n} \frac{(d_{i})^2}{(d_{i})^2+\lambda}$ . Ridge regression shrinks the coefficients of low-variance components and hence the parameter $\lambda$ controls the degrees of freedom.So for $\lambda=0$, which is the case of normal regression, df=n, and hence all the independent variables will be considered. The problem I am facing is to find the value of $\lambda$ given 'df' and the matrix 's'. I have tried to re-arrange the above equation but was not getting a closed form solution. Please provide any helpful pointers.

Best Answer

A Newton-Raphson/Fisher-scoring/Taylor-series algorithm would be suited to this.

You have the equation to solve for $\lambda$ $$h(\lambda)=\sum_{i=1}^{p}\frac{d_{i}^{2}}{d_{i}^{2}+\lambda}-df=0$$ with derivative $$\frac{\partial h}{\partial \lambda}=-\sum_{i=1}^{p}\frac{d_{i}^{2}}{(d_{i}^{2}+\lambda)^{2}}$$ You then get: $$h(\lambda)\approx h(\lambda^{(0)})+(\lambda-\lambda^{(0)})\frac{\partial h}{\partial \lambda}|_{\lambda=\lambda^{(0)}}=0$$

re-arranging for $\lambda$ you get: $$\lambda=\lambda^{(0)}-\left[\frac{\partial h}{\partial \lambda}|_{\lambda=\lambda^{(0)}}\right]^{-1}h(\lambda^{(0)})$$ This sets up the iterative search. For initial starting values, assume $d^{2}_{i}=1$ in the summation, then you get $\lambda^{(0)}=\frac{p-df}{df}$.

$$\lambda^{(j+1)}=\lambda^{(j)}+\left[\sum_{i=1}^{p}\frac{d_{i}^{2}}{(d_{i}^{2}+\lambda^{(j)})^{2}}\right]^{-1}\left[\sum_{i=1}^{p}\frac{d_{i}^{2}}{d_{i}^{2}+\lambda^{(j)}}-df\right]$$

This "goes" in the right direction (increase $\lambda$ when summation is too big, decrease it when too small), and typically only takes a few iterations to solve. Further the function is monotonic (an increase/decrease in $\lambda$ will always decrease/increase the summation), so that it will converge uniquely (no local maxima).