Solved – Ridge and LASSO given a covariance structure

lassoridge regression

After reading Chapter 3 in the Elements of Statistical Learning (Hastie, Tibshrani & Friedman), I wondered if it was possible to implement the famous shrinkage methods quoted on the title of this question given a covariance structure, i.e., minimize the (perhaps more general) quantity
$$(\vec{y}-X\vec{\beta})^TV^{-1}(\vec{y}-X\vec{\beta})+\lambda f(\beta),\ \ \ (1)$$

instead of the usual
$$(\vec{y}-X\vec{\beta})(\vec{y}-X\vec{\beta})+\lambda f(\beta).\ \ \ \ \ \ \ \ \ \ \ \ (2)$$
This was mainly motivated by the fact that in my particular application, we have different variances for the $\vec{y}$ (and sometimes even a covariance structure that can be estimated) and I would love to include them in the regression. I did it for ridge regression: at least with my implementation of it in Python/C, I see that there are important differences in the paths that the coefficients trace, which is also notable when comparing the cross-validation curves in both cases.

I was now preparing to try to implement the LASSO via Least Angle Regression, but in order to do it I have to prove first that all its nice properties are still valid when minimizing $(1)$ instead of $(2)$. So far, I haven't seen any work that actually does all this, but some time ago I also read a quote that said something like "those who don't know statistics are doomed to rediscover it" (by Brad Efron, perhaps?), so that's why I'm asking here first (given that I'm a relative newcomer to the statistics literature): is this already done somewhere for these models? Is it implemented in R in some way? (including the solution and implementation of the ridge by minimizing $(1)$ instead of $(2)$, which is what's implemented in the lm.ridge code in R)?

Thanks in advance for your answers!

Best Answer

If we know the Cholesky decomposition $V^{-1} = L^TL$, say, then $$(y - X\beta)^T V^{-1} (y - X\beta) = (Ly - LX\beta)^T (Ly - LX\beta)$$ and we can use standard algorithms (with whatever penalization function one prefers) by replacing the response with the vector $Ly$ and the predictors with the matrix $LX$.