My understanding of LASSO regression is that the regression coefficients are selected to solve the minimisation problem:
$$\min_\beta \|y – X \beta\|_2^2 \ \\s.t. \|\beta\|_1 \leq t$$
In practice this is done using a Lagrange multiplier, making the problem to solve
$$\min_\beta \|y – X \beta\|_2^2 + \lambda \|\beta\|_1 $$
What is the relationship between $\lambda$ and $t$? Wikipedia unhelpfully simply states that is "data dependent".
Why do I care? Firstly for intellectual curiosity. But I am also concerned about the consequences for selecting $\lambda$ by cross-validation.
Specifically, if I'm doing n-fold cross validation, I fit n different models to n different partitions of my training data. I then compare the accuracy of each of the models on the unused data for a given $\lambda$. But the same $\lambda$ implies a different constraint ($t$) for different subsets of the data (i.e., $t=f(\lambda)$ is "data dependent").
Isn't the cross validation problem I really want to solve to find the $t$ that gives the best bias-accuracy trade-off?
I can get a rough idea of the size of this effect in practice by calculating $\|\beta\|_1$ for each cross-validation split and $\lambda$ and looking at the resulting distribution. In some cases the implied constraint ($t$) can vary quiet substantially across my cross-validation subsets. Where by substantially I mean the coefficient of variation in $t>>0$.
Best Answer
This is the standard solution for ridge regression:
$$ \beta = \left( X'X + \lambda I \right) ^{-1} X'y $$
We also know that $\| \beta \| = t$, so it must be true that
$$ \| \left( X'X + \lambda I \right) ^{-1} X'y \| = t $$.
which is not easy to solve for $\lambda$.
Your best bet is to just keep doing what you're doing: compute $t$ on the same sub-sample of the data across multiple $\lambda$ values.