Solved – Bayesian prior corresponding to penalized regression coefficients

bayesianlassoregressionregularization

I'm working on a Bayesian Regression problem where I would like to estimate the beta coefficients subject to this constraint (penalty):

$\sum|\beta_i|<C$ or similarly $\sum \beta_i^2<C$

Which is basically a Lasso or L2 Penalty.

Now, if I understand correctly, we constrain the coefficients through the prior in Bayesian analysis. Therefore my question is what would an appropriate prior be for the Betas? I should note, that for my case, the betas are restricted to be positive, they cannot be negative.

Best Answer

L2 penalty penalizes the sum of squared betas but not via a constraint such as $< C$. The L1 penalty is the lasso. For the Bayesian lasso see the 2008 JASA paper by Trevor Park and George Cassella.

Related Solutions

Solved – Cross validation for lasso logistic regression

The short answer is, its up to you, depending on your interest. In the past I have used AIC for Lasso.

However it sounds like you are using this model for prediction, and thus using the mis-classification rate is a good idea. However misclassification can be categorized in many ways. Are you interested in the the absolute % classified correctly? Or maybe you just care about of those classified as 1 (or yes, etc), how many of those were classified correctly? I would do some reading into Positive Predictive values, Negative predictive values, etc.

https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values

In addition when doing your cross validation, there are a plethora of criteria you could use to validate your model. A short list of other common criterion are:

$R^2$
$MSE$
$Mallow’s$ $C_p$
$AIC$

Look them up and see which is most relevant to you!

Bayesian – Determining the Prior for Non-negative LASSO if Equivalent to Bayesian Regression with a Laplace Prior

We can copy the situation from Why is Lasso penalty equivalent to the double exponential (Laplace) prior?

We minimize the loss

$$L(\beta \vert X, y) = \sum_{i=1}^n (y_i-f(\beta,X_i))^2 + \lambda \sum_{i=1}^p \vert \beta_i \vert $$

which is like maximizing

$$e^{-L(\beta \vert X, y)} = e^{- \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-f(\beta,X_i))^2} \cdot e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_p \vert } $$

And appart from some scaling constant (to normalize the function) this can be seen as equal to the posterior distribution when it is the product of the likelihood of a Gaussian distribution and the prior distribution for the coefficients which is proportional to

$$f_{\text{prior}} \propto e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert }$$

With non-negative lasso this cost function remains the same (except that we do not allow negative coefficients $\beta_i$)

So we can keep the same Laplace form of the prior function, and need to change it only when $\beta_i$ are negative.

$$f_{\text{prior}} \propto \begin{cases} e^{-\lambda \sum_{i=1}^p \vert \beta_i \vert } & \quad \text{if $\forall i: \beta_i \geq 0$} \\ 0 & \quad \text{if $\exists i: \beta_i < 0$} \\ \end{cases}$$

Which is indeed just like the exponential distribution as prior.

In the case of constraints on smoothness they will end up in the distribution of the prior. For instance if you have some extra constraint which is a function of the $\beta$

$$L(\beta \vert X, y) = \sum_{i=1}^n (y_i-f(\beta,X_i))^2 + \lambda_1 \sum_{i=1}^p \vert \beta_i \vert + \lambda_2 g(\beta)$$

then this will end up as an exponential term in the prior distribution

$$f_{\text{prior}} \propto e^{-\frac{\lambda_1}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert -\frac{\lambda_2}{2\sigma^2} g(\beta) }$$

In this case you may not so easily get a simple prior distribution like the $$e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert} =e^{-\frac{\lambda}{2\sigma^2} \vert \beta_1 \vert} \cdot e^{-\frac{\lambda}{2\sigma^2} \vert \beta_2 \vert} \cdot e^{-\frac{\lambda}{2\sigma^2} \vert \beta_3 \vert} \cdot \dots $$ which partitions into independent distributions for the individual $\beta_i$.

For instance for fused Lasso, in which case the differences of neighbours $\vert \beta_{i+1} -\beta_{i} \vert$ are included in the penalty, you would get something like the following (for only two coefficients)

$$f_{\text{prior}} \propto e^{-\frac{1}{2\sigma^2} \left( \lambda_1 \vert \beta_1 \vert +\lambda_1 \vert \beta_2 \vert + \lambda_2 \vert \beta_2 - \beta_1 \vert \right) }$$

Best Answer

Related Solutions

Solved – Cross validation for lasso logistic regression

Bayesian – Determining the Prior for Non-negative LASSO if Equivalent to Bayesian Regression with a Laplace Prior

Related Question