Solved – ridge regression and function smoothness with L2 regularization

regressionregularizationridge regression

Ridge regression's objective function:
$$
L(w) = \underbrace{\|y – Xw\|^2}_\text{data term} + \underbrace{\lambda\|w\|^2}_\text{smoothness term}
$$

I am trying to understand the regularization term, $\lambda\|w\|^2$. My questions are:

  1. What does smoothness mean here?

    I checked the definition of smooth in Wolfram, but it seems not right in here.

    A smooth function is a function that has continuous derivatives up to some desired order over some domain.

  2. I read a document explaining the smoothness term.

    page 12 in the pdf

    A very common assumption
    is that the underlying function is likely to be smooth, for example, having small derivatives.
    Smoothness distinguishes the examples in Figure 2. There is also a practical reason to
    prefer smoothness, in that assuming smoothness reduces model complexity:

    I have difficulty understanding above:

    • the underlying function is smooth will have small derivatives

    • smoothness reduces model complexity.

My counterexample is:
$$
f(x) = w_0 + w_1x + w_2x^2 + w_3x^3
$$

with $w = [0.5, 0.7, 0.3, 0.4]$ , or $w = [5, 7, 3, 4]$, they are both function of $C^\infty$

I know I must be making mistakes somewhere. Please help me to correctly understand it. Thank you.

Best Answer

As @Michael Chernick said, smoothness is a bad term. I can see it making sense if you are fitting a scatterplot smoother and want to limit the second derivatives, but here its really a shrinkage parameter ($\lambda$, that is).

It penalizes large coefficients. However, it does this smoothly in the sense that it does not "zero out" any of your variables. This is different than the "LASSO" regularizer, $\lambda \|w\|_1$, which can zero out variables.

Related Question