I don't use ridge regression all that much, so I'll focus on (A) and (C):

(A) While the Lasso is traditionally motivated by the p > n scenario, it is mathematically well-defined when n > p, i.e. the solution exists and is unique assuming your design matrix is sufficiently well-behaved. All the same formulas and error bounds continue to hold when n > p. All the algorithms (at least that I know of) that produce Lasso estimates should also work when n > p.

Most of the time if n > p (especially if p is small) you probably want to think carefully about whether or not the Lasso is your best option. As usual, it is problem dependent. That being said, in some situations the Lasso may be appropriate when n > p: For example, if you have 10,000 predictors and 15,000 observations, it's likely that you will still want some kind of regularization to trim down the number of predictors and kill some of the noise. The Lasso may be helpful here.

(B) Ridge regression can be used in the p > n situation to alleviate singularity issues in the design matrix. This may be useful if sparsity / feature selection is not important. Moreover, ridge regression has a very nice closed form solution that is easily interpreted, and this can be helpful in practice. In essence, you add a positive term to the main diagonal, which improves the regularity of the sample covariance (specifically, it removes vanishing eigenvalues as long as enough regularization is applied).

(I'll leave it the experts to address this one more thoughtfully.)

(C) Soft thresholding and the Lasso are closely related, but not identical. One interpretation of soft thresholding is as the special case of Lasso regression when the predictors are orthogonal, which is of course a restrictive assumption.

Another interpretation of soft thresholding is as the one-at-a-time update in coordinate descent algorithms for the Lasso. I recommend the paper "Pathwise Coordinate Optimization" by Friedman et al for an introduction to these concepts. For a slightly more recent and more general treatment, there is the excellent paper "SparseNet: Coordinate Descent With Nonconvex Penalties" by Mazumder et al.

Honestly, I do not think taking a log will always be a good idea even it can give you positive responses because it will stress more on small violations than higher violations --- small violations will have relatively higher weights in the loss in log scale than in normal scales. If this is not what you want, you probably should not use it.

And I think a simple idea is to just use the normal model and training. When it gives negative responses, set them to 0. Tune the hyperparameter based on this. I think this can give you a reasonable model.

## Best Answer

The rather anti-climatic answer to "

Does anyone know why this is?" is that simply nobody cares enough to implement a non-negative ridge regression routine. One of the main reasons is that people have already started implementingnon-negative elastic netroutines (for example here and here). Elastic net includes ridge regression as a special case (one essentially set the LASSO part to have a zero weighting). These works are relatively new so they have not yet been incorporated in scikit-learn or a similar general use package. You might want to inquire the authors of these papers for code.EDIT:As @amoeba and I discussed on the comments the actual implementation of this is relative simple. Say one has the following regression problem to:

$y = 2 x_1 - x_2 + \epsilon, \qquad \epsilon \sim N(0,0.2^2)$

where $x_1$ and $x_2$ are both standard normals such as: $x_p \sim N(0,1)$. Notice I use standardised predictor variables so I do not have to normalise afterwards. For simplicity I do not include an intercept either. We can immediately solve this regression problem using standard linear regression. So in R it should be something like this:

Notice the last line. Almost all linear regression routine use the QR decomposition to estimate $\beta$. We would like to use the same for our ridge regression problem. At this point read this post by @whuber; we will be implementing

exactlythis procedure. In short, we will be augmenting our original design matrix $X$ with a $\sqrt{\lambda}I_p$ diagonal matrix and our response vector $y$ with $p$ zeros. In that way we will be able to re-express the original ridge regression problem $(X^TX + \lambda I)^{-1} X^Ty$ as $(\bar{X}^T\bar{X})^{-1} \bar{X}^T\bar{y}$ where the $\bar{}$ symbolises the augmented version. Check slides 18-19 from these notes too for completeness, I found them quite straightforward. So in R we would some like the following:and it works. OK, so we got the ridge regression part. We could solve in another way though, we could formulate it as an optimisation problem where the residual sum of squares is the cost function and then optimise against it, ie. $ \displaystyle \min_{\beta} || \bar{y} - \bar{X}\beta||_2^2$. Sure enough we can do that:

which as expected again works. So now we just want : $ \displaystyle \min_{\beta} || \bar{y} - \bar{X}\beta||_2^2$ where $\beta \geq 0$. Which is simply the same optimisation problem but constrained so that the solution are non-negative.

which shows that the original non-negative ridge regression task can be solved by reformulating as a simple constrained optimisation problem. Some caveats:

nonnormalisation of the intercept.`optim`

's L-BFGS-B argument. It is the most vanilla R solver that accepts bounds. I am sure that you will find dozens of better solvers.Code for point 5: