Solved – Should the lambda of ridge regression be related to number of data points

regressionregularizationridge regression

Suppose we have data $ (x_{1}, y_{1})\ldots (x_{N}, y_{N})$. The loss function of ridge regression is
$$
\sum_i^N{(y_i – x^T_i\mathbf{\beta})^2} + \lambda \sum_j^p{\beta^2_j}
$$

Notice that $ \sum_i^N{(y_i – x^T_i\mathbf{\beta})^2}$ term is related to number of all $(x_{i}, y_{i})$ pairs, increasing with length of $y$ vector. However, $\sum_j{\beta^2_j}$ is not.

How about $\lambda$ ?. Should it be related to $N$? It seems that if we keep $\lambda$ fixed, the penalty will be weaker with more data points. Am I wrong?

Best Answer

There are many methods for selecting the regularization parameter ridge regression (also known as Tikhonov regularization) including the L-curve method, the discrepancy principle, and generalized cross validation. Assuming that the least squares problem is ill-conditioned so as to require regularization, all of the commonly used methods that I've just mentioned produce a value of $\lambda$ that depends not just on the number of data points, but also on the particular values of $y$. Using any of these methods, the value of $\lambda$ that is selected will typically increase with the number of data points.

In machine learning applications, it's typical to adjust $\lambda$ so that that the prediction error over the validation set is minimized. Here, the choice of $\lambda$ depends on both the training data and the validation data. As the size of the training set grows, it's typical for the optimal $\lambda$ to grow.

If you think of ridge regression as finding the Bayesian maximum a posteriori probability (MAP) solution beginning with a multivariate normal prior for the parameters $\beta$ and assuming a multivariate normal likelihood, then you get the ridge regression least squares problem to estimate the MAP solution. In this framework, since $\lambda$ is associated with the prior covariance it should not change as you add more data points.