There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.
I don't use ridge regression all that much, so I'll focus on (A) and (C):
(A) While the Lasso is traditionally motivated by the p > n scenario, it is mathematically well-defined when n > p, i.e. the solution exists and is unique assuming your design matrix is sufficiently well-behaved. All the same formulas and error bounds continue to hold when n > p. All the algorithms (at least that I know of) that produce Lasso estimates should also work when n > p.
Most of the time if n > p (especially if p is small) you probably want to think carefully about whether or not the Lasso is your best option. As usual, it is problem dependent. That being said, in some situations the Lasso may be appropriate when n > p: For example, if you have 10,000 predictors and 15,000 observations, it's likely that you will still want some kind of regularization to trim down the number of predictors and kill some of the noise. The Lasso may be helpful here.
(B) Ridge regression can be used in the p > n situation to alleviate singularity issues in the design matrix. This may be useful if sparsity / feature selection is not important. Moreover, ridge regression has a very nice closed form solution that is easily interpreted, and this can be helpful in practice. In essence, you add a positive term to the main diagonal, which improves the regularity of the sample covariance (specifically, it removes vanishing eigenvalues as long as enough regularization is applied).
(I'll leave it the experts to address this one more thoughtfully.)
(C) Soft thresholding and the Lasso are closely related, but not identical. One interpretation of soft thresholding is as the special case of Lasso regression when the predictors are orthogonal, which is of course a restrictive assumption.
Another interpretation of soft thresholding is as the one-at-a-time update in coordinate descent algorithms for the Lasso. I recommend the paper "Pathwise Coordinate Optimization" by Friedman et al for an introduction to these concepts. For a slightly more recent and more general treatment, there is the excellent paper "SparseNet: Coordinate Descent With Nonconvex Penalties" by Mazumder et al.
Best Answer
Check out
rqPen
andhqreg
packages in R which claim to perform quantile regression with lasso and elastic net respectively. Maybe you know this already but least absolute deviation regression is median regression or quantile regression at the 50% percentile. Minimizing the absolute deviation results in the median (with potential problem of multiple solutions), same as minimizing the squared deviation results in the mean.