Bayesian Regression – Is There a Bayesian Interpretation of Linear Regression with Simultaneous L1 and L2 Regularization (Elastic Net)?

bayesianelastic netpriorregressionregularization

It's well known that linear regression with an $l^2$ penalty is equivalent to finding the MAP estimate given a Gaussian prior on the coefficients. Similarly, using an $l^1$ penalty is equivalent to using a Laplace distribution as the prior.

It's not uncommon to use some weighted combination of $l^1$ and $l^2$ regularization. Can we say that this is equivalent to some prior distribution over the coefficients (intuitively, it seems that it must be)? Can we give this distribution a nice analytic form (maybe a mixture of Gaussian and Laplacian)? If not, why not?

Best Answer

Ben's comment is likely sufficient, but I provide some more references one of which is from before the paper Ben referenced.

A Bayesian elastic net representation was proposed by Kyung et. al. in their Section 3.1. Although the prior for the regression coefficient $\beta$ was correct, the authors incorrectly wrote down the mixture representation.

A corrected Bayesian model for the elastic net was recently proposed by Roy and Chakraborty (their Equation 6). The authors also go on to present an appropriate Gibbs sampler to sample from the posterior distribution, and show that the Gibbs sampler converges to the stationary distribution at a geometric rate. For this reason, these references might turn out to be useful, in addition to the Hans paper.

1. Slightly different definitions of $\lambda$

The optimization function used in the article is

$$S(\alpha,\beta;x,y) = \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

The optimization function used by pythons sklearn has a factor $\frac{1}{2n}$ difference.

$$S(\alpha,\beta;x,y) = \frac{1}{2n}\sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

So your code will works when you divide your reg_alpha by 2n.

See the manual

The optimization objective for Lasso is:

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

2. Using the right $\sigma$

In addition, as Tim mentioned, you need to use the $\sigma$ that is the optimum in the Bayesian model.

With $f_\sigma(\sigma)$, $f_\alpha(\alpha)$, $f_\beta(\beta) = \frac{1}{2b} e^{-\frac{|\beta|}{b}}$ your prior functions, and $h(x,y)$ a normalization function, the logarithm of the posterior density is the following

$$\begin{array}{rcllllllll} S(\alpha,\beta,\sigma;x,y) &=& \rlap{\overbrace{\phantom{n -\frac{n}{2} \log 2\pi - n \log \sigma + \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 }}^{\text{likelihood}}} n \log \frac{1}{\sqrt{2\pi\sigma^2}} &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \overbrace{ \log f_\beta(\beta_1) + \log f_\beta(\beta_1)}^{\text{prior of beta}} &+& \overbrace{ \log f_\sigma(\sigma) + \log f_\alpha(\alpha) }^{\text{prior other parameters}} &+& \overbrace{\log h(x,y)}^{\text{normalization function}}\\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} + \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \rlap{\underbrace{\phantom{\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \sum_{p=1}^2 |\beta_i| }}_{\text{part that depends on the $\beta_i$}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &-& \frac{1}{}b\sum_{p=1}^2 |\beta_i| + \log \frac{1}{2b} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \end{array}$$

The part that depends on the $\beta$ is to be compared with the cost function.

$$\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{1}{b} \sum_{p=1}^2 |\beta_i| $$

or multiplied by $\frac{\sigma^2}{n}$

$$\frac{1}{2n} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{\sigma^2}{nb} \sum_{p=1}^2 |\beta_i| $$

this is the same as the first (l1 regularization) cost function if $$\lambda = \frac{\sigma^2}{nb}$$

Here $\sigma$ is the $\sigma$ that is part of the maximum a posteriori probability estimate.

3. Prior on $\alpha$

When you have a prior on $\alpha$ then you get a different estimate for this parameter. This will influence the estimates of the $\beta$ since the $X$ variables correlate with the intercept and they will correct for the the different $\alpha$.

4. Accuracy of the computations

Another source of variation between the two methods is that the algorithms might not converge accurately and due to early stopping you can get slightly different answers.

Code example

The code below combines the four points above

import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso

np.random.seed(1)

k = 2
n = 150

true_sigma = 1
true_alpha = 5

coefs = np.random.rand(k)*3
X = np.random.rand(n, k)
y = (X * coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)

basic_model = pm.Model()
b = 1

with basic_model:
    alpha = pm.Flat("alpha")                       # different prior
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfFlat("sigma")                       # different prior

    mu = alpha + (beta * X).sum(axis=1)

    Y_obs = pm.Normal("Y_obs", mu=mu, sd=sigma, observed=y)

    step = pm.NUTS()
    pm.sample(2000, tune = 1000, step = step)      # different tolerance to have the algorithm converge further
    
map_estimate = pm.find_MAP(model=basic_model,  maxeval=10**6)
#print(map_estimate['sigma'])

# now using Lasso
reg_alpha = map_estimate['sigma']**2 / b / n                 # different alpha
lr = Lasso(alpha=reg_alpha, tol = 1e-6, max_iter = 10^5)     # different tolerance to have the algorithm converge further
lr.fit(X, y)
print("\n")
print(lr.coef_)
print(map_estimate['beta'])

output

[1.15586011 2.45047217]
[1.15586059 2.45047129]

Why is l1 regularization so much faster than find_MAP?

Your question seems to be motivated by the use of Lasso as a faster alternative.

So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.

But the Pymc3 library is an overkill for this problem and that is why it works so relatively slow. The find_MAP function is computing the posterior function by means of (slow) sampling.

This is unnecessary for this problem where an expression for the posterior is already known.
Pymc3 and sampling with a monte carlo method is useful when the expression for the posterior is too complex to be computed by hand or with a computational method or algebraic software program.

For instance, the posterior could involve a step like integration to marginalize out nuisance parameters. This is also already the case here when we look for the posterior of the $\beta$ only. E.g. when we look to maximize the marginal distribution of $\beta$

$$S(\beta;x,y) = \iint S(\alpha,\beta,\sigma;x,y) d\alpha d\sigma$$

^{But to be honest, I am not sure whether find_MAP is optimizing to find the maximum of the joint distribution or to find the individual maxima of the marginal distributions.}

Best Answer

Related Solutions

Solved – How to (L1 / L2) regularization be equivalent to using a prior when priors can’t be changed

Regularization – Relationship Between Laplace and L1 Regularization