Solved – Confused by MATLAB’s implementation of ridge

MATLABridge regressionsparse

I have two different implementations of ridge in MATLAB. One is simply

$\mathbf x = (\mathbf{A}'\mathbf{A}+\mathbf{I}\lambda)^{-1}\mathbf{A}'\mathbf b$

(as seen on Wikipedia's ridge regression page), with $\mathbf{I}$ being the identity matrix of size columns($\mathbf{A}$) $\times$ columns($\mathbf{A}$), and
I'm simply calling Matlab's "ridge" with
```
x = ridge(A, b, lambda)
```

My problem is that both return different results. (1) returns the results that I want (I know this by comparing results with other people) but why does (2) not return the same results?

My matrix $\mathbf A$ is sparse, it's filled with 1% 1's and 99% 0's. Some columns contain almost no 1's. The biggest difference seems to be that the coefficient for those columns with very few 1's are very close to 0 in (1), but can be quite far from 0 in (2)

Does anyone have any idea why it's different and how I can modify the call in (2) to give the same results as (1)?

Best Answer

This is a matlab program to validate what cardinal said, it is actually due to the centering and scaling

% Create A(10 by 3 matrix) and b(10 by 1 matrix)
A=rand(10,3);
b=rand(10,1);
lambda=0.01
% centering and scaling A 
s=std(A,0,1);
s=repmat(s,10,1);
A=(A-repmat(mean(A),10,1))./s;

%check the result
X1=inv(A'*A+eye(3)*lambda)*A'*b;
X2=ridge(b,A,lambda,1);

x1 then equal x2

Variance of Ridge Estimator

Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is

$$ \begin{align*} E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\ & = \gamma_1 (k) + \gamma_2(k) \\ & = R(k) \end{align*} $$ where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.

Supposing $X^T X = \mathbf{I}_p$, then $$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$ Let $$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$. Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.

The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.

Comment

There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.

Why is ridge regression usually recommended only in the case of correlated predictors?

H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.

But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.

Solved – Interpretation of ridge regularization in regression

Good questions!

Yes, this is exactly correct. You can see ridge penalty as one possible way to deal with multicollinearity problem that arises when many predictors are highly correlated. Introducing ridge penalty effectively lowers these correlations.
I think this is partly tradition, partly the fact that ridge regression formula as stated in your first equation follows from the following cost function: $$L=\| \mathbf y - \mathbf X \beta \|^2 + \lambda \|\beta\|^2.$$ If $\lambda=0$, the second term can be dropped, and minimizing the first term ("reconstruction error") leads to the standard OLS formula for $\beta$. Keeping the second term leads to the formula for $\beta_\mathrm{ridge}$. This cost function is mathematically very convenient to deal with, and this might be one of the reasons for preferring "non-normalized" lambda.
One possible way to normalize $\lambda$ is to scale it by the total variance $\mathrm{tr}(\mathbf X^\top \mathbf X)$, i.e. to use $\lambda \mathrm{tr}(\mathbf X^\top \mathbf X)$ instead of $\lambda$. This would not necessarily confine $\lambda$ to $[0,1]$, but would make it "dimensionless" and would probably result in optimal $\lambda$ being less then $1$ in all practical cases (NB: this is just a guess!).
"Attacking only small eigenvalues" does have a separate name and is called principal components regression. The connection between PCR and ridge regression is that in PCR you effectively have a "step penalty" cutting off all the eigenvalues after a certain number, whereas ridge regression applies a "soft penalty", penalizing all eigenvalues, with smaller ones getting penalized more. This is nicely explained in The Elements of Statistical Learning by Hastie et al. (freely available online), section 3.4.1. See also my answer in Relationship between ridge regression and PCA regression.
I have never seen this done, but note that you could consider a cost function in the form $$L=\| \mathbf y - \mathbf X \beta \|^2 + \lambda \|\beta-\beta_0\|^2.$$ This shrinks your $\beta$ not to zero, but to some other pre-defined value $\beta_0$. If one works out the math, you will arrive to the optimal $\beta$ given by $$\beta = (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} (\mathbf X^\top \mathbf y + \lambda \beta_0),$$ which perhaps can be seen as "regularizing cross-covariance"?

Best Answer

Related Solutions

Solved – Under exactly what conditions is ridge regression able to provide an improvement over ordinary least squares regression

Variance of Ridge Estimator

Comment

Why is ridge regression usually recommended only in the case of correlated predictors?

Solved – Interpretation of ridge regularization in regression

Related Question