There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.
The answer to both 1 and 2 is no, but care is needed in interpreting the existence theorem.
Variance of Ridge Estimator
Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is
$$
\begin{align*}
E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\
& = \gamma_1 (k) + \gamma_2(k) \\
& = R(k)
\end{align*}
$$
where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.
Supposing $X^T X = \mathbf{I}_p$, then
$$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$
Let
$$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$.
Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.
The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.
Comment
There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.
Why is ridge regression usually recommended only in the case of correlated predictors?
H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.
But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.
Best Answer
Let $\mathbf X$ be the centered $n \times p$ predictor matrix and consider its singular value decomposition $\mathbf X = \mathbf{USV}^\top$ with $\mathbf S$ being a diagonal matrix with diagonal elements $s_i$.
The fitted values of ordinary least squares (OLS) regression are given by $$\hat {\mathbf y}_\mathrm{OLS} = \mathbf X \beta_\mathrm{OLS} = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y = \mathbf U \mathbf U^\top \mathbf y.$$ The fitted values of the ridge regression are given by $$\hat {\mathbf y}_\mathrm{ridge} = \mathbf X \beta_\mathrm{ridge} = \mathbf X (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y = \mathbf U\: \mathrm{diag}\left\{\frac{s_i^2}{s_i^2+\lambda}\right\}\mathbf U^\top \mathbf y.$$ The fitted values of the PCA regression (PCR) with $k$ components are given by $$\hat {\mathbf y}_\mathrm{PCR} = \mathbf X_\mathrm{PCA} \beta_\mathrm{PCR} = \mathbf U\: \mathrm{diag}\left\{1,\ldots, 1, 0, \ldots 0\right\}\mathbf U^\top \mathbf y,$$ where there are $k$ ones followed by zeroes.
From here we can see that:
If $\lambda=0$ then $\hat {\mathbf y}_\mathrm{ridge} = \hat {\mathbf y}_\mathrm{OLS}$.
If $\lambda>0$ then the larger the singular value $s_i$, the less it will be penalized in ridge regression. Small singular values ($s_i^2 \approx \lambda$ and smaller) are penalized the most.
In contrast, in PCA regression, large singular values are kept intact, and the small ones (after certain number $k$) are completely removed. This would correspond to $\lambda=0$ for the first $k$ ones and $\lambda=\infty$ for the rest.
This means that ridge regression can be seen as a "smooth version" of PCR.
(This intuition is useful but does not always hold; e.g. if all $s_i$ are approximately equal, then ridge regression will only be able to penalize all principal components of $\mathbf X$ approximately equally and can strongly differ from PCR).
Ridge regression tends to perform better in practice (e.g. to have higher cross-validated performance).
Answering now your question specifically: if $\lambda \to 0$, then $\hat {\mathbf y}_\mathrm{ridge} \to \hat {\mathbf y}_\mathrm{OLS}$. I don't see how it can correspond to removing the smallest $s_i$. I think this is wrong.
One good reference is The Elements of Statistical Learning, Section 3.4.1 "Ridge regression".
See also this thread: Interpretation of ridge regularization in regression and in particular the answer by @BrianBorchers.