This problem is typically solved by fit by coordinate descent (see here). This method is both safer more efficient numerically, algorithmically easier to implement and applicable to a more general array of models (also including Cox regression). An R implementation is available in the R package glmnet. The codes are open source (partly in and in C, partly in R), so you can use them as blueprints.
I don't use ridge regression all that much, so I'll focus on (A) and (C):
(A) While the Lasso is traditionally motivated by the p > n scenario, it is mathematically well-defined when n > p, i.e. the solution exists and is unique assuming your design matrix is sufficiently well-behaved. All the same formulas and error bounds continue to hold when n > p. All the algorithms (at least that I know of) that produce Lasso estimates should also work when n > p.
Most of the time if n > p (especially if p is small) you probably want to think carefully about whether or not the Lasso is your best option. As usual, it is problem dependent. That being said, in some situations the Lasso may be appropriate when n > p: For example, if you have 10,000 predictors and 15,000 observations, it's likely that you will still want some kind of regularization to trim down the number of predictors and kill some of the noise. The Lasso may be helpful here.
(B) Ridge regression can be used in the p > n situation to alleviate singularity issues in the design matrix. This may be useful if sparsity / feature selection is not important. Moreover, ridge regression has a very nice closed form solution that is easily interpreted, and this can be helpful in practice. In essence, you add a positive term to the main diagonal, which improves the regularity of the sample covariance (specifically, it removes vanishing eigenvalues as long as enough regularization is applied).
(I'll leave it the experts to address this one more thoughtfully.)
(C) Soft thresholding and the Lasso are closely related, but not identical. One interpretation of soft thresholding is as the special case of Lasso regression when the predictors are orthogonal, which is of course a restrictive assumption.
Another interpretation of soft thresholding is as the one-at-a-time update in coordinate descent algorithms for the Lasso. I recommend the paper "Pathwise Coordinate Optimization" by Friedman et al for an introduction to these concepts. For a slightly more recent and more general treatment, there is the excellent paper "SparseNet: Coordinate Descent With Nonconvex Penalties" by Mazumder et al.
Best Answer
This can be attacked in a number of ways, including fairly economical approaches via the Karush–Kuhn–Tucker conditions.
Below is a quite elementary alternative argument.
The least squares solution for an orthogonal design
Suppose $X$ is composed of orthogonal columns. Then, the least-squares solution is $$ \newcommand{\bls}{\hat{\beta}^{{\small \text{LS}}}}\newcommand{\blasso}{\hat{\beta}^{{\text{lasso}}}} \bls = (X^T X)^{-1} X^T y = X^T y \>. $$
Some equivalent problems
Via the Lagrangian form, it is straightforward to see that an equivalent problem to that considered in the question is $$ \min_\beta \frac{1}{2} \|y - X \beta\|_2^2 + \gamma \|\beta\|_1 \>. $$
Expanding out the first term we get $\frac{1}{2} y^T y - y^T X \beta + \frac{1}{2}\beta^T \beta$ and since $y^T y$ does not contain any of the variables of interest, we can discard it and consider yet another equivalent problem, $$ \min_\beta (- y^T X \beta + \frac{1}{2} \|\beta\|^2) + \gamma \|\beta\|_1 \>. $$
Noting that $\bls = X^T y$, the previous problem can be rewritten as $$ \min_\beta \sum_{i=1}^p - \bls_i \beta_i + \frac{1}{2} \beta_i^2 + \gamma |\beta_i| \> . $$
Our objective function is now a sum of objectives, each corresponding to a separate variable $\beta_i$, so they may each be solved individually.
The whole is equal to the sum of its parts
Fix a certain $i$. Then, we want to minimize $$ \mathcal L_i = -\bls_i \beta_i + \frac{1}{2}\beta_i^2 + \gamma |\beta_i| \> . $$
If $\bls_i > 0$, then we must have $\beta_i \geq 0$ since otherwise we could flip its sign and get a lower value for the objective function. Likewise if $\bls_i < 0$, then we must choose $\beta_i \leq 0$.
Case 1: $\bls_i > 0$. Since $\beta_i \geq 0$, $$ \mathcal L_i = -\bls_i \beta_i + \frac{1}{2}\beta_i^2 + \gamma \beta_i \> , $$ and differentiating this with respect to $\beta_i$ and setting equal to zero, we get $\beta_i = \bls_i - \gamma$ and this is only feasible if the right-hand side is nonnegative, so in this case the actual solution is $$ \blasso_i = (\bls_i - \gamma)^+ = \mathrm{sgn}(\bls_i)(|\bls_i| - \gamma)^+ \>. $$
Case 2: $\bls_i \leq 0$. This implies we must have $\beta_i \leq 0$ and so $$ \mathcal L_i = -\bls_i \beta_i + \frac{1}{2}\beta_i^2 - \gamma \beta_i \> . $$ Differentiating with respect to $\beta_i$ and setting equal to zero, we get $\beta_i = \bls_i + \gamma = \mathrm{sgn}(\bls_i)(|\bls_i| - \gamma)$. But, again, to ensure this is feasible, we need $\beta_i \leq 0$, which is achieved by taking $$ \blasso_i = \mathrm{sgn}(\bls_i)(|\bls_i| - \gamma)^+ \>. $$
In both cases, we get the desired form, and so we are done.
Final remarks
Note that as $\gamma$ increases, then each of the $|\blasso_i|$ necessarily decreases, hence so does $\|\blasso\|_1$. When $\gamma = 0$, we recover the OLS solutions, and, for $\gamma > \max_i |\bls_i|$, we obtain $\blasso_i = 0$ for all $i$.