Consider the following ridge regression problem: minimize the loss function $\sum_{i=1}^n ||y_i – w^T x_i||_2^2 + \lambda ||w||_2^2$ with respect to the weight vector w. Taking derivative with respect to w, I get $\sum_{i=1}^n 2(y_i – w^T x_i)(-x_i) + 2\lambda w$ which implies $w =(\sum_{i=1}^n (y_i – w^T x_i)(x_i)) / 2\lambda $. Is this wrong? I know that the solution is $(X^TX – \lambda I)^{-1}X^Ty$.
Solved – Why are solution to ridge regression always expressed using matrix notation
regressionridge regression
Related Solutions
It suffices to modify the loss function by adding the penalty. In matrix terms, the initial quadratic loss function becomes $$ (Y - X\beta)^{T}(Y-X\beta) + \lambda \beta^T\beta.$$ Deriving with respect to $\beta$ leads to the normal equation $$ X^{T}Y = \left(X^{T}X + \lambda I\right)\beta $$ which leads to the Ridge estimator.
Since you ask for insights, I'm going to take a fairly intuitive approach rather than a more mathematical tack:
Following the concepts in my answer here, we can formulate a ridge regression as a regression with dummy data by adding $p$ (in your formulation) observations, where $y_{n+j}=0$, $x_{j,n+j}=\sqrt{\lambda}$ and $x_{i,n+j}=0$ for $i\neq j$. If you write out the new RSS for this expanded data set, you'll see the additional observations each add a term of the form $(0-\sqrt{\lambda}\beta_j)^2=\lambda\beta_j^2$, so the new RSS is the original $\text{RSS} + \lambda \sum_{j=1}^p\beta_j^2$ -- and minimizing the RSS on this new, expanded data set is the same as minimizing the ridge regression criterion.
So what can we see here? As $\lambda$ increases, the additional $x$-rows each have one component that increases, and so the influence of these points also increases. They pull the fitted hyperplane toward themselves. Then as $\lambda$ and the corresponding components of the $x$'s go off to infinity, all the involved coefficients "flatten out" to $0$.
That is, as $\lambda\to\infty$, the penalty will dominate the minimization, so the $\beta$s will go to zero. If the intercept is not penalized (the usual case) then the model shrinks more and more toward the mean of the response.
I'll give an intuitive sense of why we're talking about ridges first (which also suggests why it's needed), then tackle a little history. The first is adapted from my answer here:
If there's multicollinearity, you get a "ridge" in the likelihood function (likelihood is a function of the $\beta$'s). This in turn yields a long "valley" in the RSS (since RSS=$-2\log\mathcal{L}$).
Ridge regression "fixes" the ridge - it adds a penalty that turns the ridge into a nice peak in likelihood space, equivalently a nice depression in the criterion we're minimizing:
The actual story behind the name is a little more complicated. In 1959 A.E. Hoerl [1] introduced ridge analysis for response surface methodology, and it very soon [2] became adapted to dealing with multicollinearity in regression ('ridge regression'). See for example, the discussion by R.W. Hoerl in [3], where it describes Hoerl's (A.E. not R.W.) use of contour plots of the response surface* in the identification of where to head to find local optima (where one 'heads up the ridge'). In ill-conditioned problems, the issue of a very long ridge arises, and insights and methodology from ridge analysis are adapted to the related issue with the likelihood/RSS in regression, producing ridge regression.
* examples of response surface contour plots (in the case of quadratic response) can be seen here (Fig 3.9-3.12).
That is, "ridge" actually refers to the characteristics of the function we were attempting to optimize, rather than to adding a "ridge" (+ve diagonal) to the $X^TX$ matrix (so while ridge regression does add to the diagonal, that's not why we call it 'ridge' regression).
For some additional information on the need for ridge regression, see the first link under list item 2. above.
References:
[1]: Hoerl, A.E. (1959). Optimum solution of many variables equations. Chemical Engineering Progress, 55 (11) 69-78.
[2]: Hoerl, A.E. (1962). Applications of ridge analysis to regression problems. Chemical Engineering Progress, 58 (3) 54-59.
[3] Hoerl, R.W. (1985). Ridge Analysis 25 Years Later. American Statistician, 39 (3), 186-192
Best Answer
Your derivative is okay. Just remember to put all the $w$-terms on the same side of the equation $$\eqalign{ \sum_i x_i y_i &= \lambda w + \sum_i x_i x_i^Tw \cr }$$ Then pull $w$ out of the summation, since it's independent of $i$ $$\eqalign{ \sum_i y_i x_i &= \Big(\lambda I + \sum_i x_ix_i^T\Big)w \cr }$$ At this point, dispose of the summations in favor of matrix notation $$\eqalign{ X^Ty &= \big(\lambda I + X^TX\big)w \cr }$$ where $x_i$ is the $i^{th}$ column of $X,\,$ and $\,y_i$ is the $i^{th}$ component of $y$.