$\alpha=\frac{\lambda_1}{\lambda_1+\lambda_2} \text{and } 1-\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}$. And because $\lambda_i\ge0,$ it should be clear that $\alpha\in[0,1].$ So in glmnet, $\lambda=\lambda_1+\lambda_2$, and each penalty has a coefficient that is either $\alpha(\lambda_1+\lambda_2)$ or $(1-\alpha)(\lambda_1+\lambda_2)$.
But treating $\alpha$ independently of $\lambda_1, \lambda_2$ is convenient as a conceptual model because it controls how much of a ridge and lasso penalty is applied, with either extreme arising as a special case. And you can make a model "more lasso" or "more ridge" by adjusting $\alpha$ without having to worry about how to adjust $\lambda_i$ relative to the size of $\lambda_j, j\neq i$. That is, treated separately, $\alpha$ controls the range of elastic net compositions on a continuum of ridge to lasso, while $\lambda$ controls the overall magnitude of the penalty. The two can be thought of as distinct model hyper-parameters. The method with two lambdas links the two penalties.
And if both $\lambda_1$ and $\lambda_2$ are 0, that should correspond to no penalty, but the fraction $\frac{\lambda_1}{\lambda_1+\lambda_2}=\frac{0}{0}$ is unsightly and indeterminate.
First, we're going to re-express these equations using common notation.
The coefficient vector $w$ does not include a constant element for an intercept. So we have to revise $x$ to contain a constant; likewise the relation between $w$ and $\beta$ is $\beta = [c ~ w]$ the concatenation of $w$ and $c$ the value of the intercept from the first equation. This means that $\beta$ has $m$ elements, indexed from 0: $0, 1, 2, \dots, m$ and we have $x_0=1$ and $\beta_0 =c$.
Rewriting the first equation, we have
$$\begin{align}
\text{Loss} = L_A &= \sum_{i=0}^n \left(y_i - (w x_i + c)\right)^2 + \lambda_1 \sum_{j=0}^{m-1} \left|w_j\right| + \lambda_2 \sum_{j=0}^{m-1} w_j^2 \\
&= \sum_{i=0}^n \left(y_i - x_i^T\beta \right)^2 + \lambda_1 \sum_{j=1}^m \left|\beta_j\right| + \lambda_2 \sum_{j=1}^{m} \beta_j^2
\end{align}
$$
Note that we've changed the indexing so that the intercept is not included in the penalty.
I think it's confusing to have $\lambda_1$, $\lambda_2$ and $\lambda$ all in the same context, so I'm going to rewrite $\lambda = \frac{1}{2n}\gamma$, which just re-scales $\lambda$ to be expressed units that depend on $n$. So our second equation is
$$
L_\text{enet} = L_B = \frac{1}{2n}\sum_{i=1}^n (y_i - x_i^T \beta)^2 + \frac{1}{2n} \gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^m \beta_j^2+ \alpha \sum_{j=1}^m \left|\beta_j\right| \right)
$$
When one chooses suitable hyperparameters, there is a unique global minimum for both equations. The first equation is strictly convex when at least one of $\lambda_1, \lambda_2$ is positive, and any remaining are 0. The second equation is strictly convex for $\gamma>0$ and $0 \le \alpha \le 1$.
We can show that this minimum has the same coefficient vector in either case when the hyper-parameters $\lambda_1, \lambda_2, \gamma, \alpha$ are well-chosen. This is not a coincidence -- it's because you can choose to re-write the expressions. In this sense, the equations are equivalent because they result in the same model, i.e. the same $\beta$.
Finally, because we only care about the the location of the minimum, but not the value of the minimum itself, these equations can be arbitrarily re-scaled. In other words, $L_A$ is proportional to $L_b$. I'll denote the scaling with some constant $C>0$.
$$
\begin{align}
L_A &\propto L_B \\
L_A &= C L_B \\
\sum_{i=0}^n \left(y_i - x_i^T\beta \right)^2 + \lambda_1 \sum_{j=1}^n \left|\beta_j\right| + \lambda_2 \sum_{j=1}^n \beta_j^2
&= C \frac{1}{2n}\sum_{i=1}^n (y_i - x_i^T \beta)^2 + C \frac{1}{2n} \gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ \alpha \sum_{j=1}^n \left|\beta_j \right| \right) \\
2n\sum_{i=0}^n \left(y_i - x_i^T\beta \right)^2 + 2n\lambda_1 \sum_{j=1}^n \left|\beta_j\right| + 2n\lambda_2 \sum_{j=1}^n \beta_j^2
&= C \sum_{i=1}^n (y_i - x_i^T \beta)^2 + C \gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ \alpha \sum_{j=1}^n \left|\beta_j\right| \right) \\
&= C \sum_{i=1}^n (y_i - x_i^T \beta)^2 + C\gamma \left( \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ \alpha \sum_{j=1}^n \left|\beta_j\right| \right) \\
2n\lambda_1 \sum_{j=1}^n \left|\beta_j\right| + 2n\lambda_2 \sum_{j=1}^n \beta_j^2 &= C\gamma \frac{1-\alpha}{2} \sum_{j=1}^n \beta_j^2+ C\gamma \alpha \sum_{j=1}^n \left|\beta_j\right|
\end{align}
$$
if we choose $C = 2n$. By inspection, we can now write $$
\lambda_1 = \gamma \alpha \\
\lambda_2 = \gamma\frac{1 -\alpha}{2}
$$
Best Answer
I emailed this question to Zou and to Hastie and got the following reply from Hastie (I hope he wouldn't mind me quoting it here):
I interpret these words as an endorsement of some form of "rescaling" of the vanilla elastic net solution, but Hastie does not anymore seem to stand by the particular approach put forward in Zou & Hastie 2005.
In the following I will briefly review and compare several rescaling options.
I will be using
glmnet
parametrization of the loss $$\mathcal L = \frac{1}{2n}\big\lVert y - \beta_0-X\beta\big\rVert^2 + \lambda\big(\alpha\lVert \beta\rVert_1 + (1-\alpha) \lVert \beta\rVert^2_2/2\big),$$ with the solution denoted as $\hat\beta$.The approach of Zou & Hastie is to use $$\hat\beta_\text{rescaled} = \big(1+\lambda(1-\alpha)\big)\hat\beta.$$ Note that this yields some non-trivial rescaling for pure ridge when $\alpha=0$ which arguably does not make a lot of sense. On the other hand, this yields no rescaling for pure lasso when $\alpha=1$, despite various claims in the literature that lasso estimator could benefit from some rescaling (see below).
For pure lasso, Tibshirani suggested to use lasso-OLS hybrid, i.e. to use OLS estimator using the subset of predictors selected by lasso. This makes the estimator consistent (but undoes the shrinkage, which can increase the expected error). One can use the same approach for elastic net $$\hat\beta_\text{elastic-OLS-hybrid}= \text{OLS}(X_i\mid\hat\beta_i\ne 0)$$ but the potential problem is that elastic net can select more than $n$ predictors and OLS will break down (in contrast, pure lasso never selects more than $n$ predictors).
Relaxed lasso mentioned in the Hastie's email quoted above is a suggestion to run another lasso on the subset of predictors selected by the first lasso. The idea is to use two different penalties and to select both via cross-validation. One could apply the same idea to elastic net, but this would seem to require four different regularization parameters and tuning them is a nightmare.
I suggest a simpler relaxed elastic net scheme: after obtaining $\hat\beta$, perform ridge regression with $\alpha=0$ and the same $\lambda$ on the selected subset of predictors: $$\hat\beta_\text{relaxed-elastic-net}= \text{Ridge}(X_i\mid\hat\beta_i\ne 0).$$ This (a) does not require any additional regularization parameters, (b) works for any number of selected predictors, and (c) does not do anything if one starts with pure ridge. Sounds good to me.
I am currently working with a small $n\ll p$ dataset with $n=44$ and $p=3000$, where $y$ is well predicted by the few leading PCs of $X$. I will compare the performance of the above estimators using 100x repeated 11-fold cross-validation. As a performance metric, I am using test error, normalized to yield something like an R-squared: $$R^2_\text{test} = 1-\frac{\lVert y_\text{test} - \hat\beta_0 - X_\text{test}\hat\beta\rVert^2}{\lVert y_\text{test} - \hat\beta_0\rVert^2}.$$ In the figure below, dashed lines correspond to the vanilla elastic net estimator $\hat\beta$ and three subplots correspond to the three rescaling approaches:
So, at least in these data, all three approaches outperform the vanilla elastic net estimator, and "relaxed elastic net" performs the best.