Regularization – Relationship Between Laplace and L1 Regularization

bayesianlaplace-distributionlassopymcregularization

It is well known that an L1 regularized linear regression is equivalent to a regression with a Laplace prior on the distribution of the coefficients. This is explained here:

https://bjlkeng.github.io/posts/probabilistic-interpretation-of-regularization/

I would love to make use of this face and convert a bayesian model I have with a Laplace prior to a simple Lasso regression using sklearn, which is much faster. However, when I try to follow to formula for the conversion of the b for the Laplace prior and the alpha for L1 – I do not get the expected results. According to the article above, the conversion from the b scale parameter of Laplace to the alpha of Lasso should be 2*sig^2 /b.

Using pymc3 to implement the bayesian model with a Laplace prior:

import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso


k = 2
n = 150

true_sigma = 1
true_alpha = 5

coefs = np.random.rand(k)*3
X = np.random.rand(n, k)
y = (X * coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)

basic_model = pm.Model()
b = 3

with basic_model:
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfNormal("sigma", sigma=3)

    mu = alpha + (beta * X).sum(axis=1)

    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=y)

map_estimate = pm.find_MAP(model=basic_model)

# now using Lasso
reg_alpha = 2*true_sigma**2 / b
lr = Lasso(alpha=reg_alpha)
lr.fit(X, y)


We can now compare lr.coef_ and map_estimate["beta"].
I played with the data drawn, sigma and b and found that the results are rarely the same. If I manually change the reg_alpha, I can find a value that will produce similar coefs, but I cannot find a consistent formula.

Even if this was solved – the theoretical formula involves knowing the true sigma (noise), which obviously we cannot do. Is there no way to convert the bayesian model given some b to an equivalent Lasso with the correct alpha?

Edit:

We found a solution.
first, there were a few inaccuracies in the pymc3 model, making it slightly in-equivalent to Lasso. They don't make much of a different, but the correct model would be:

with basic_model:
    alpha = pm.Flat("alpha")
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfFlat("sigma")

    mu = (beta * X).sum(axis=1)  + alpha
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=y)

map_estimate = pm.find_MAP(model=basic_model)

We can then estimate the MAP using Lasso:

reg_alpha = map_estimate['sigma']**2 / (b * n)
lr = Lasso(alpha=reg_alpha, tol = 1e-6, fit_intercept=True)
lr.fit(X, y)

but it requires as estimate for the true error of the model using map_estimate['sigma'].
However, we can iteratively re-estimate the error using this code:

tolerance = 1e-6
max_iter = 10

est_sigma = 1
for i in range(max_iter):
    reg_alpha = est_sigma**2 / (b * n)
    lr = Lasso(alpha=reg_alpha, fit_intercept=True)
    lr.fit(X, y)
    new_est_sigma = (lr.predict(X) - y).std()
    if abs(new_est_sigma - est_sigma) < tolerance:
        break
    est_sigma = new_est_sigma
    print(lr.coef_, est_sigma)

which runs much faster than pymc's code. So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.

Best Answer

There are four points of improvement to make the relationship between the l1 regularization and the Bayesian MAP estimate equivalent.

1. Slightly different definitions of $\lambda$

The optimization function used in the article is

$$S(\alpha,\beta;x,y) = \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

The optimization function used by pythons sklearn has a factor $\frac{1}{2n}$ difference.

$$S(\alpha,\beta;x,y) = \frac{1}{2n}\sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

So your code will works when you divide your reg_alpha by 2n.

See the manual

The optimization objective for Lasso is:

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

2. Using the right $\sigma$

In addition, as Tim mentioned, you need to use the $\sigma$ that is the optimum in the Bayesian model.

With $f_\sigma(\sigma)$, $f_\alpha(\alpha)$, $f_\beta(\beta) = \frac{1}{2b} e^{-\frac{|\beta|}{b}}$ your prior functions, and $h(x,y)$ a normalization function, the logarithm of the posterior density is the following

$$\begin{array}{rcllllllll} S(\alpha,\beta,\sigma;x,y) &=& \rlap{\overbrace{\phantom{n -\frac{n}{2} \log 2\pi - n \log \sigma + \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 }}^{\text{likelihood}}} n \log \frac{1}{\sqrt{2\pi\sigma^2}} &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \overbrace{ \log f_\beta(\beta_1) + \log f_\beta(\beta_1)}^{\text{prior of beta}} &+& \overbrace{ \log f_\sigma(\sigma) + \log f_\alpha(\alpha) }^{\text{prior other parameters}} &+& \overbrace{\log h(x,y)}^{\text{normalization function}}\\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} + \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \rlap{\underbrace{\phantom{\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \sum_{p=1}^2 |\beta_i| }}_{\text{part that depends on the $\beta_i$}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &-& \frac{1}{}b\sum_{p=1}^2 |\beta_i| + \log \frac{1}{2b} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \end{array}$$

The part that depends on the $\beta$ is to be compared with the cost function.

$$\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{1}{b} \sum_{p=1}^2 |\beta_i| $$

or multiplied by $\frac{\sigma^2}{n}$

$$\frac{1}{2n} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{\sigma^2}{nb} \sum_{p=1}^2 |\beta_i| $$

this is the same as the first (l1 regularization) cost function if $$\lambda = \frac{\sigma^2}{nb}$$

Here $\sigma$ is the $\sigma$ that is part of the maximum a posteriori probability estimate.

3. Prior on $\alpha$

When you have a prior on $\alpha$ then you get a different estimate for this parameter. This will influence the estimates of the $\beta$ since the $X$ variables correlate with the intercept and they will correct for the the different $\alpha$.

4. Accuracy of the computations

Another source of variation between the two methods is that the algorithms might not converge accurately and due to early stopping you can get slightly different answers.

Code example

The code below combines the four points above

import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso

np.random.seed(1)

k = 2
n = 150

true_sigma = 1
true_alpha = 5

coefs = np.random.rand(k)*3
X = np.random.rand(n, k)
y = (X * coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)

basic_model = pm.Model()
b = 1

with basic_model:
    alpha = pm.Flat("alpha")                       # different prior
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfFlat("sigma")                       # different prior

    mu = alpha + (beta * X).sum(axis=1)

    Y_obs = pm.Normal("Y_obs", mu=mu, sd=sigma, observed=y)

    step = pm.NUTS()
    pm.sample(2000, tune = 1000, step = step)      # different tolerance to have the algorithm converge further
    
map_estimate = pm.find_MAP(model=basic_model,  maxeval=10**6)
#print(map_estimate['sigma'])

# now using Lasso
reg_alpha = map_estimate['sigma']**2 / b / n                 # different alpha
lr = Lasso(alpha=reg_alpha, tol = 1e-6, max_iter = 10^5)     # different tolerance to have the algorithm converge further
lr.fit(X, y)
print("\n")
print(lr.coef_)
print(map_estimate['beta'])

output

[1.15586011 2.45047217]
[1.15586059 2.45047129]

Why is l1 regularization so much faster than find_MAP?

Your question seems to be motivated by the use of Lasso as a faster alternative.

So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.

But the Pymc3 library is an overkill for this problem and that is why it works so relatively slow. The find_MAP function is computing the posterior function by means of (slow) sampling.

  • This is unnecessary for this problem where an expression for the posterior is already known.
  • Pymc3 and sampling with a monte carlo method is useful when the expression for the posterior is too complex to be computed by hand or with a computational method or algebraic software program.

For instance, the posterior could involve a step like integration to marginalize out nuisance parameters. This is also already the case here when we look for the posterior of the $\beta$ only. E.g. when we look to maximize the marginal distribution of $\beta$

$$S(\beta;x,y) = \iint S(\alpha,\beta,\sigma;x,y) d\alpha d\sigma$$

But to be honest, I am not sure whether find_MAP is optimizing to find the maximum of the joint distribution or to find the individual maxima of the marginal distributions.