Elastic Net Regression – How to Transition from Elastic Net Loss to Scikit-Learn Implementation

elastic netregressionregularization

I couldn't find a better title, but here's the thing…

I was studying Elastic Net regularization and I found this function:
$$
\text{Loss} = \sum_{i=0}^n \left(y_i – (wx_i + c)\right)^2 + \lambda_1 \sum_{j=0}^{m-1} \left|w_j\right| + \lambda_2 \sum_{j=0}^{m-1} w_j^2 \\
$$

Sorry about the image, but it's much easier this way. So I found this as being the loss function. However, there is no lambda 1, nor lambda 2 in Scikit-Learn. Instead, we find alpha and l1_ration. After studying more, I found that Scikit-Learn alpha is actually lambda and Scikit-Learn l1_ratio is actually alpha, both from this other equation:

$$
L_\text{enet} = \frac{1}{2n}\sum_{i=1}^n (y_i – x_i^T \beta)^2 + \lambda \left( \frac{1-\alpha}{2} \sum_{j=1}^m \beta_j^2+ \alpha \sum_{j=1}^m \left|\beta_j\right| \right)
$$

So I guess the biggest question is: how do I go from the first equation to the last equation? How are these two equations connected?

I couldn't find a way to go from the first equation to the second one.

Again, sorry about the image, I know this isn't the best practice, but it was so much easier to just add them in here.

NOTE: Please consider w and beta as the same thing, the coefficients from the regression.

Best Answer

First, we're going to re-express these equations using common notation.

The coefficient vector $w$ does not include a constant element for an intercept. So we have to revise $x$ to contain a constant; likewise the relation between $w$ and $\beta$ is $\beta = [c ~ w]$ the concatenation of $w$ and $c$ the value of the intercept from the first equation. This means that $\beta$ has $m$ elements, indexed from 0: $0, 1, 2, \dots, m$ and we have $x_0=1$ and $\beta_0 =c$.

Rewriting the first equation, we have

$$\begin{align} \text{Loss} = L_A &= \sum_{i=0}^n \left(y_i - (w x_i + c)\right)^2 + \lambda_1 \sum_{j=0}^{m-1} \left|w_j\right| + \lambda_2 \sum_{j=0}^{m-1} w_j^2 \\ &= \sum_{i=0}^n \left(y_i - x_i^T\beta \right)^2 + \lambda_1 \sum_{j=1}^m \left|\beta_j\right| + \lambda_2 \sum_{j=1}^{m} \beta_j^2 \end{align} $$

Note that we've changed the indexing so that the intercept is not included in the penalty.

I think it's confusing to have $\lambda_1$, $\lambda_2$ and $\lambda$ all in the same context, so I'm going to rewrite $\lambda = \frac{1}{2n}\gamma$, which just re-scales $\lambda$ to be expressed units that depend on $n$. So our second equation is

$$ L_\text{enet} = L_B = \frac{1}{2n}\sum_{i=1}^n (y_i - x_i^T \beta)^2 + \frac{1}{2n} \gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^m \beta_j^2+ \alpha \sum_{j=1}^m \left|\beta_j\right| \right) $$

When one chooses suitable hyperparameters, there is a unique global minimum for both equations. The first equation is strictly convex when at least one of $\lambda_1, \lambda_2$ is positive, and any remaining are 0. The second equation is strictly convex for $\gamma>0$ and $0 \le \alpha \le 1$.

We can show that this minimum has the same coefficient vector in either case when the hyper-parameters $\lambda_1, \lambda_2, \gamma, \alpha$ are well-chosen. This is not a coincidence -- it's because you can choose to re-write the expressions. In this sense, the equations are equivalent because they result in the same model, i.e. the same $\beta$.

Finally, because we only care about the the location of the minimum, but not the value of the minimum itself, these equations can be arbitrarily re-scaled. In other words, $L_A$ is proportional to $L_b$. I'll denote the scaling with some constant $C>0$.

$$ \begin{align} L_A &\propto L_B \\ L_A &= C L_B \\ \sum_{i=0}^n \left(y_i - x_i^T\beta \right)^2 + \lambda_1 \sum_{j=1}^n \left|\beta_j\right| + \lambda_2 \sum_{j=1}^n \beta_j^2 &= C \frac{1}{2n}\sum_{i=1}^n (y_i - x_i^T \beta)^2 + C \frac{1}{2n} \gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ \alpha \sum_{j=1}^n \left|\beta_j \right| \right) \\ 2n\sum_{i=0}^n \left(y_i - x_i^T\beta \right)^2 + 2n\lambda_1 \sum_{j=1}^n \left|\beta_j\right| + 2n\lambda_2 \sum_{j=1}^n \beta_j^2 &= C \sum_{i=1}^n (y_i - x_i^T \beta)^2 + C \gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ \alpha \sum_{j=1}^n \left|\beta_j\right| \right) \\ &= C \sum_{i=1}^n (y_i - x_i^T \beta)^2 + C\gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ \alpha \sum_{j=1}^n \left|\beta_j\right| \right) \\ 2n\lambda_1 \sum_{j=1}^n \left|\beta_j\right| + 2n\lambda_2 \sum_{j=1}^n \beta_j^2 &= C\gamma \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ C\gamma \alpha \sum_{j=1}^n \left|\beta_j\right| \end{align} $$ if we choose $C = 2n$. By inspection, we can now write $$ \lambda_1 = \gamma \alpha \\ \lambda_2 = \gamma\frac{1 -\alpha}{2} $$

Related Question