[Math] Understanding Regularization parameters in Machine Learning/Statistics

Suppose I have the following $k$ degree polynomial regression model with a data set of size $n$ which includes a $k$-dimensional feature vector $x$ and an outcome denoted $t_i$ for each vector in the dataset.

\begin{equation}
\sum\limits_{i=1}^n (t_i – (w_0 + w_1x + w_2x^2+ w_3x^3 \ldots + w_kx^k))^2 + \lambda\sum\limits_{j=1}^k w_j^2
\end{equation}

where $w_j$ are the parameters. Now I wish to minimise such a function. How exactly does $\lambda$ affect how the $w_j$'s? I know that the $\lambda$ can be used to avoid overfitting, but I don't see exactly how this is the case. I've seen plots where they vary the value of $\lambda$ and they will show you that as you initially start increasing $\lambda$ the "fit" is better but eventually the "fit" starts becoming worse while your increase $\lambda$. Is this always the case?

Thanks for all the help!

EDIT Please also help with the following:

I am trying to understand the regularization parameter for Support Vector Machines. If our classifying line is given by $y=wx + b$ where $w$ is the weight vector, $x$ is the input vector and $b$ is the bias, and we have errors $\epsilon_i$ for each input $x_i$ then we need to minimize a function of our weights and errors in order to maximize the margin while separating our data-points.
\begin{equation}
L(w, \epsilon) = ww + \lambda\sum\limits_{i=1}^R \epsilon_i
\end{equation}

What exactly does lambda do? My book says that: “If $\lambda$ is small then we prize a large margin over a few errors and vice versa.'' I can't seem to find any reasoning for this to be the case that I am satisfied with.

Thanks for all the help!

Best Answer

At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model $w$ viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by $\lambda$, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. $\lambda$ just determines the relative importance of keeping the model simple relative to reducing training error.

There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is $L^2$ regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is $L^1$ or lasso regularization which has the form $\lambda\sum_i |w_i|$, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of $L^2$ regularization. Generally $L^2$ tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters.

There is also elastic net regularization, which is just a weighted combination of $L^1$ and $L^2$ regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.

Best Answer

Related Solutions

[Math] Connection between Vapnik-Chervonenkis dimension and regularization

Bishop – Pattern Recognition & Machine Learning, Exercise 1.2