Solved – Interpretation of ridge regularization in regression

pcaregressionregularizationridge regression

I have several questions regarding the ridge penalty in the least squares context:

$$\beta_{ridge} = (\lambda I_D + X'X)^{-1}X'y$$

1) The expression suggests that the covariance matrix of X is shrunk towards a diagonal matrix, meaning that (assuming that variables are standardized before the procedure) correlation among input variables will be lowered. Is this interpretation correct?

2) If it is a shrinkage application why is it not formulated in the lines of $(\lambda I_D + (1-\lambda)X'X)$, assuming that we can somehow restrict lambda to [0,1] range with a normalization.

3) What can be a normalization for $\lambda$ so that it can be restricted to a standard range like [0,1].

4) Adding a constant to the diagonal will affect all eigenvalues. Would it be better to attack only the singular or near singular values? Is this equivalent to applying PCA to X and retaining top-N principal components before regression or does it have a different name (since it doesn't modify the cross covariance calculation)?

5) Can we regularize the cross covariance, or does it have any use, meaning $$\beta_{ridge} = (\lambda I_D + X'X)^{-1}(\gamma X'y)$$

where a small $\gamma$ will lower the cross covariance. Obviously this lowers all $\beta$s equally, but perhaps there is a smarter way like hard/soft thresholding depending on covariance value.

Best Answer

Good questions!

  1. Yes, this is exactly correct. You can see ridge penalty as one possible way to deal with multicollinearity problem that arises when many predictors are highly correlated. Introducing ridge penalty effectively lowers these correlations.

  2. I think this is partly tradition, partly the fact that ridge regression formula as stated in your first equation follows from the following cost function: $$L=\| \mathbf y - \mathbf X \beta \|^2 + \lambda \|\beta\|^2.$$ If $\lambda=0$, the second term can be dropped, and minimizing the first term ("reconstruction error") leads to the standard OLS formula for $\beta$. Keeping the second term leads to the formula for $\beta_\mathrm{ridge}$. This cost function is mathematically very convenient to deal with, and this might be one of the reasons for preferring "non-normalized" lambda.

  3. One possible way to normalize $\lambda$ is to scale it by the total variance $\mathrm{tr}(\mathbf X^\top \mathbf X)$, i.e. to use $\lambda \mathrm{tr}(\mathbf X^\top \mathbf X)$ instead of $\lambda$. This would not necessarily confine $\lambda$ to $[0,1]$, but would make it "dimensionless" and would probably result in optimal $\lambda$ being less then $1$ in all practical cases (NB: this is just a guess!).

  4. "Attacking only small eigenvalues" does have a separate name and is called principal components regression. The connection between PCR and ridge regression is that in PCR you effectively have a "step penalty" cutting off all the eigenvalues after a certain number, whereas ridge regression applies a "soft penalty", penalizing all eigenvalues, with smaller ones getting penalized more. This is nicely explained in The Elements of Statistical Learning by Hastie et al. (freely available online), section 3.4.1. See also my answer in Relationship between ridge regression and PCA regression.

  5. I have never seen this done, but note that you could consider a cost function in the form $$L=\| \mathbf y - \mathbf X \beta \|^2 + \lambda \|\beta-\beta_0\|^2.$$ This shrinks your $\beta$ not to zero, but to some other pre-defined value $\beta_0$. If one works out the math, you will arrive to the optimal $\beta$ given by $$\beta = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} (\mathbf X^\top \mathbf y + \lambda \beta_0),$$ which perhaps can be seen as "regularizing cross-covariance"?