There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.
Connection between James–Stein estimator and ridge regression
Let $\mathbf y$ be a vector of observation of $\boldsymbol \theta$ of length $m$, ${\mathbf y} \sim N({\boldsymbol \theta}, \sigma^2 I)$, the James-Stein estimator is,
$$\widehat{\boldsymbol \theta}_{JS} =
\left( 1 - \frac{(m-2) \sigma^2}{\|{\mathbf y}\|^2} \right) {\mathbf y}.$$
In terms of ridge regression, we can estimate $\boldsymbol \theta$ via $\min_{\boldsymbol{\theta}} \|\mathbf{y}-\boldsymbol{\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2 ,$
where the solution is $$\widehat{\boldsymbol \theta}_{\mathrm{ridge}} = \frac{1}{1+\lambda}\mathbf y.$$
It is easy to see that the two estimators are in the same form, but we need to estimate $\sigma^2$ in James-Stein estimator, and determine $\lambda$ in ridge regression via cross-validation.
Connection between James–Stein estimator and random effects models
Let us discuss the mixed/random effects models in genetics first. The model is $$\mathbf {y}=\mathbf {X}\boldsymbol{\beta} + \boldsymbol{Z\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$
If there is no fixed effects and $\mathbf {Z}=I$, the model becomes
$$\mathbf {y}=\boldsymbol{\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I),$$
which is equivalent to the setting of James-Stein estimator, with some Bayesian idea.
Connection between random effects models and ridge regression
If we focus on the random effects models above,
$$\mathbf {y}=\mathbf {Z\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$
The estimation is equivalent to solve the problem
$$\min_{\boldsymbol{\theta}} \|\mathbf{y}-\mathbf {Z\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2$$
when $\lambda=\sigma^2/\sigma_{\theta}^2$. The proof can be found in Chapter 3 of Pattern recognition and machine learning.
Connection between (multilevel) random effects models and that in genetics
In the random effects model above, the dimension of $\mathbf y$ is $m\times 1,$ and that of $\mathbf Z$ is $m \times p$. If we vectorize $\mathbf Z$ as $(mp)\times 1,$ and repeat $\mathbf y$ correspondingly, then we have the hierarchical/clustered structure, $p$ clusters and each with $m$ units. If we regress $\mathrm{vec}(\mathbf Z)$ on repeated $\mathbf y$, then we can obtain the random effect of $Z$ on $y$ for each cluster, though it is kind of like reverse regression.
Acknowledgement: the first three points are largely learned from these two Chinese articles, 1, 2.
Best Answer
You are right to ask this question. In general, when a proper accuracy scoring rule is used (e.g., mean squared prediction error), ridge regression will outperform lasso. Lasso spends some of the information trying to find the "right" predictors and it's not even great at doing that in many cases. Relative performance of the two will depend on the distribution of true regression coefficients. If you have a small fraction of nonzero coefficients in truth, lasso can perform better. Personally I use ridge almost all the time when interested in predictive accuracy.