Solved – Connection between Lasso formulations

lasso

This question might be dumb, but I noticed that there are two different formulations of the Lasso regression. We know that the Lasso problem is to minimize the objective consisting of the square loss plus the $L$-1 penalty term, expressed as follows,
$$
\min_\beta \|y – X \beta\|_2^2 + \lambda \|\beta\|_1 \;
$$

But often time I saw the Lasso estimator can be written as
$$
\hat{\beta}_n(\lambda) = \displaystyle\arg \min_{\beta} \{\frac {1}{2n} \|y – X \beta\|_2^2 + \lambda \|\beta\|_1 \}
$$

My question is, are the equivalent? Where does the term $\frac {1}{2n}$ come in? The connections between the two formulations is not obvious to me.

[Update] I guess anther question I should ask is,

Why is there the second formulation? What's the advantage, theoretically or computationally, of formulating the problem that way?

Best Answer

They are indeed equivalent since you can always rescale $\lambda$ (see also @whuber's comment). From a theoretical perspective, it is a matter of convenience but as far as I know it is not necessary. From a computational perspective, I actually find the $1/(2n)$ quite annoying, so I usually use the first formulation if I am designing an algorithm that uses regularization.

A little backstory: When I first started learning about penalized methods, I got annoyed carrying the $1/(2n)$ around everywhere in my work so I preferred to ignore it -- it even simplified some of my calculations. At that time my work was mainly computational. More recently I have been doing theoretical work, and I have found the $1/(2n)$ indispensable (even vs., say, $1/n$).

More details: When you try to analyze the behaviour of the Lasso as function of the sample size $n$, you frequently have to deal with sums of iid random variables, and in practice it is generally more convenient to analyze such sums after normalizing by $n$--think law of large numbers / central limit theorem (or if you want to get fancy, concentration of measure and empirical process theory). If you don't have the $1/n$ term in front of the loss, you ultimately end up rescaling something at the end of the analysis so it's generally nicer to have it there to start with. The $1/2$ is convenient because it cancels out some annoying factors of $2$ in the analysis (e.g. when you take the derivative of the squared loss term).

Another way to think of this is that when doing theory, we are generally interested in the behaviour of solutions as $n$ increases -- that is, $n$ is not some fixed quantity. In practice, when we run the Lasso on some fixed dataset, $n$ is indeed fixed from the perspective of the algorithm / computations. So having the extra normalizing factor out front isn't all that helpful.

These may seem like annoying matters of convenience, but after spend enough time manipulating these kinds of inequalities, I've learned to love the $1/(2n)$.