There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.
Connection between James–Stein estimator and ridge regression
Let $\mathbf y$ be a vector of observation of $\boldsymbol \theta$ of length $m$, ${\mathbf y} \sim N({\boldsymbol \theta}, \sigma^2 I)$, the James-Stein estimator is,
$$\widehat{\boldsymbol \theta}_{JS} =
\left( 1 - \frac{(m-2) \sigma^2}{\|{\mathbf y}\|^2} \right) {\mathbf y}.$$
In terms of ridge regression, we can estimate $\boldsymbol \theta$ via $\min_{\boldsymbol{\theta}} \|\mathbf{y}-\boldsymbol{\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2 ,$
where the solution is $$\widehat{\boldsymbol \theta}_{\mathrm{ridge}} = \frac{1}{1+\lambda}\mathbf y.$$
It is easy to see that the two estimators are in the same form, but we need to estimate $\sigma^2$ in James-Stein estimator, and determine $\lambda$ in ridge regression via cross-validation.
Connection between James–Stein estimator and random effects models
Let us discuss the mixed/random effects models in genetics first. The model is $$\mathbf {y}=\mathbf {X}\boldsymbol{\beta} + \boldsymbol{Z\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$
If there is no fixed effects and $\mathbf {Z}=I$, the model becomes
$$\mathbf {y}=\boldsymbol{\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I),$$
which is equivalent to the setting of James-Stein estimator, with some Bayesian idea.
Connection between random effects models and ridge regression
If we focus on the random effects models above,
$$\mathbf {y}=\mathbf {Z\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$
The estimation is equivalent to solve the problem
$$\min_{\boldsymbol{\theta}} \|\mathbf{y}-\mathbf {Z\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2$$
when $\lambda=\sigma^2/\sigma_{\theta}^2$. The proof can be found in Chapter 3 of Pattern recognition and machine learning.
Connection between (multilevel) random effects models and that in genetics
In the random effects model above, the dimension of $\mathbf y$ is $m\times 1,$ and that of $\mathbf Z$ is $m \times p$. If we vectorize $\mathbf Z$ as $(mp)\times 1,$ and repeat $\mathbf y$ correspondingly, then we have the hierarchical/clustered structure, $p$ clusters and each with $m$ units. If we regress $\mathrm{vec}(\mathbf Z)$ on repeated $\mathbf y$, then we can obtain the random effect of $Z$ on $y$ for each cluster, though it is kind of like reverse regression.
Acknowledgement: the first three points are largely learned from these two Chinese articles, 1, 2.
Best Answer
From the context, I'm assuming that the $\beta_j's$ are the regular least squares estimates, and the table is showing how they would be transformed under each of the listed methods.
Best Subset:
Because the columns are orthonormal, the least squares coefficients are simply $\hat{B_j} = {x_j^{T}y}$. (Orthogonality implies that they're given by $\hat{B_j} = {\frac{x_j^{T}y}{x_j^{T}x_j}}$, but since we have orthonormal columns, ${x_j^{T}x_j}$ = 1.)
Then by definition of best subset, we're looking for the $M$ predictors that gives the smalles residual sum of squares. This is equivalent to finding the $M$ largest (in absolute value) coefficients. This might already be intuitive, but if not, note that the residual sum of squares from regressing ${y}$ on ${x_j}$ is given by:
$r_j = (y - x_j\hat{\beta_j})^T(y - x_j\hat{\beta_j})$
$= y^Ty - 2\hat{\beta_j}x_j^Ty + \hat{{\beta_j}}^2$
$= y^Ty - 2(x_j^Ty)^2 + (x_j^Ty)^2$ (applying the solution of $\hat{B_j} = {x_j^{T}y}$)
$= y^Ty-(x_j^Ty)^2$
$= y^Ty - {|\hat{B_j}|}^2$
Which is clearly minimized by having $|\hat{B_j}|$ as large as possible.
It follows then that the solution for best subset with $M$ predictors is to regress $y$ on each $x_j$, order the coefficients by size in absolute value, and then choose the $M$ largest of them, which is what is given by the solution in the table.
Lasso:
The lasso coefficient for regressing $y$ on $x_j$ is finding the $\hat{\beta}$ that minimizes $\frac{1}{2}(y - x_j\hat{\beta})^T(y - x_j\hat{\beta}) + \lambda|\hat{\beta}|$. Now assume that $\hat{\beta} \neq 0$. Taking the derivative of that expression with respect to $\hat{\beta}$ and setting equal to 0 gives
$-x_j^T(y - x_j\hat{\beta}) + sign(\hat{\beta})\lambda = 0$, where we need the sign operator because the derivative of $|\hat{\beta}|$ is $1$ if $\hat{\beta}$ > 0 and $-1$ otherwise.
Simplifying the expression above gives
$-x_j^Ty + x_j^Tx_j\hat{\beta} + sign(\hat{\beta})\lambda = 0$
$\implies \hat{\beta} = x_j^Ty - sign(\hat{\beta})\lambda$ (where we used the fact that $x_j^Tx_j = 1$, since the columns are orthonormal.
$\implies \hat{\beta} = \hat{\beta_j} - sign(\hat{\beta})\lambda$ (recall the definition of $\hat{\beta_j}$, the least squares solution).
Now we consider cases for the sign of $\hat{\beta}$:
In each of these, we required that $|\hat{\beta_j}| > \lambda$. If that was wrong, or our initial assumption that $\hat{\beta} \neq 0$ must have been wrong and we have $\hat{\beta} = 0$, which means we can say that we only take the positive part ($(|\hat{\beta_j}| - \lambda)_+$) in each of the solutions since otherwise $\hat{\beta} = 0$.
Therefore, you get the solution in the table.