Let $\mathbf X$ be the centered $n \times p$ predictor matrix and consider its singular value decomposition $\mathbf X = \mathbf{USV}^\top$ with $\mathbf S$ being a diagonal matrix with diagonal elements $s_i$.
The fitted values of ordinary least squares (OLS) regression are given by $$\hat {\mathbf y}_\mathrm{OLS} = \mathbf X \beta_\mathrm{OLS} = \mathbf X (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf y = \mathbf U \mathbf U^\top \mathbf y.$$ The fitted values of the ridge regression are given by $$\hat {\mathbf y}_\mathrm{ridge} = \mathbf X \beta_\mathrm{ridge} = \mathbf X (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} \mathbf X^\top \mathbf y = \mathbf U\: \mathrm{diag}\left\{\frac{s_i^2}{s_i^2+\lambda}\right\}\mathbf U^\top \mathbf y.$$ The fitted values of the PCA regression (PCR) with $k$ components are given by $$\hat {\mathbf y}_\mathrm{PCR} = \mathbf X_\mathrm{PCA} \beta_\mathrm{PCR} = \mathbf U\: \mathrm{diag}\left\{1,\ldots, 1, 0, \ldots 0\right\}\mathbf U^\top \mathbf y,$$ where there are $k$ ones followed by zeroes.
From here we can see that:
If $\lambda=0$ then $\hat {\mathbf y}_\mathrm{ridge} = \hat {\mathbf y}_\mathrm{OLS}$.
If $\lambda>0$ then the larger the singular value $s_i$, the less it will be penalized in ridge regression. Small singular values ($s_i^2 \approx \lambda$ and smaller) are penalized the most.
In contrast, in PCA regression, large singular values are kept intact, and the small ones (after certain number $k$) are completely removed. This would correspond to $\lambda=0$ for the first $k$ ones and $\lambda=\infty$ for the rest.
This means that ridge regression can be seen as a "smooth version" of PCR.
(This intuition is useful but does not always hold; e.g. if all $s_i$ are approximately equal, then ridge regression will only be able to penalize all principal components of $\mathbf X$ approximately equally and can strongly differ from PCR).
Ridge regression tends to perform better in practice (e.g. to have higher cross-validated performance).
Answering now your question specifically: if $\lambda \to 0$, then $\hat {\mathbf y}_\mathrm{ridge} \to \hat {\mathbf y}_\mathrm{OLS}$. I don't see how it can correspond to removing the smallest $s_i$. I think this is wrong.
One good reference is The Elements of Statistical Learning, Section 3.4.1 "Ridge regression".
See also this thread: Interpretation of ridge regularization in regression and in particular the answer by @BrianBorchers.
In an unpenalized regression, you can often get a ridge* in parameter space, where many different values along the ridge all do as well or nearly as well on the least squares criterion.
* (at least, it's a ridge in the likelihood function -- they're actually valleys$ in the RSS criterion, but I'll continue to call it a ridge, as this seems to be conventional -- or even, as Alexis points out in comments, I could call that a thalweg, being the valley's counterpart of a ridge)
In the presence of a ridge in the least squares criterion in parameter space, the penalty you get with ridge regression gets rid of those ridges by pushing the criterion up as the parameters head away from the origin:
[Clearer image]
In the first plot, a large change in parameter values (along the ridge) produces a miniscule change in the RSS criterion. This can cause numerical instability; it's very sensitive to small changes (e.g. a tiny change in a data value, even truncation or rounding error). The parameter estimates are almost perfectly correlated. You may get parameter estimates that are very large in magnitude.
By contrast, by lifting up the thing that ridge regression minimizes (by adding the $L_2$ penalty) when the parameters are far from 0, small changes in conditions (such as a little rounding or truncation error) can't produce gigantic changes in the resulting estimates. The penalty term results in shrinkage toward 0 (resulting in some bias). A small amount of bias can buy a substantial improvement in the variance (by eliminating that ridge).
The uncertainty of the estimates are reduced (the standard errors are inversely related to the second derivative, which is made larger by the penalty).
Correlation in parameter estimates is reduced. You now won't get parameter estimates that are very large in magnitude if the RSS for small parameters would not be much worse.
Best Answer
There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.