Bayesian – Determining the Prior for Non-negative LASSO if Equivalent to Bayesian Regression with a Laplace Prior

bayesianexponential distributionlassopriorregularization

We know that the LASSO penalty is equivalent to Laplace prior. So what would be the corresponding prior for a non-negative LASSO? Is it exponential distribution?
More generally, is it true that every constraint can be translated into a prior distribution under the Bayesian framework? Then if we have constraints like sparse, non-negative, smoothness, etc., what will be the strategy to find the corresponding Bayesian prior?
I know this might be too general, but I believe I am not the first one to consider this question. Any suggestions/references/papers will be much appreciated!

Best Answer

We can copy the situation from Why is Lasso penalty equivalent to the double exponential (Laplace) prior?

We minimize the loss

$$L(\beta \vert X, y) = \sum_{i=1}^n (y_i-f(\beta,X_i))^2 + \lambda \sum_{i=1}^p \vert \beta_i \vert $$

which is like maximizing

$$e^{-L(\beta \vert X, y)} = e^{- \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-f(\beta,X_i))^2} \cdot e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_p \vert } $$

And appart from some scaling constant (to normalize the function) this can be seen as equal to the posterior distribution when it is the product of the likelihood of a Gaussian distribution and the prior distribution for the coefficients which is proportional to

$$f_{\text{prior}} \propto e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert }$$


With non-negative lasso this cost function remains the same (except that we do not allow negative coefficients $\beta_i$)

So we can keep the same Laplace form of the prior function, and need to change it only when $\beta_i$ are negative.

$$f_{\text{prior}} \propto \begin{cases} e^{-\lambda \sum_{i=1}^p \vert \beta_i \vert } & \quad \text{if $\forall i: \beta_i \geq 0$} \\ 0 & \quad \text{if $\exists i: \beta_i < 0$} \\ \end{cases}$$

Which is indeed just like the exponential distribution as prior.


In the case of constraints on smoothness they will end up in the distribution of the prior. For instance if you have some extra constraint which is a function of the $\beta$

$$L(\beta \vert X, y) = \sum_{i=1}^n (y_i-f(\beta,X_i))^2 + \lambda_1 \sum_{i=1}^p \vert \beta_i \vert + \lambda_2 g(\beta)$$

then this will end up as an exponential term in the prior distribution

$$f_{\text{prior}} \propto e^{-\frac{\lambda_1}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert -\frac{\lambda_2}{2\sigma^2} g(\beta) }$$

In this case you may not so easily get a simple prior distribution like the $$e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert} =e^{-\frac{\lambda}{2\sigma^2} \vert \beta_1 \vert} \cdot e^{-\frac{\lambda}{2\sigma^2} \vert \beta_2 \vert} \cdot e^{-\frac{\lambda}{2\sigma^2} \vert \beta_3 \vert} \cdot \dots $$ which partitions into independent distributions for the individual $\beta_i$.

For instance for fused Lasso, in which case the differences of neighbours $\vert \beta_{i+1} -\beta_{i} \vert$ are included in the penalty, you would get something like the following (for only two coefficients)

$$f_{\text{prior}} \propto e^{-\frac{1}{2\sigma^2} \left( \lambda_1 \vert \beta_1 \vert +\lambda_1 \vert \beta_2 \vert + \lambda_2 \vert \beta_2 - \beta_1 \vert \right) }$$