Solved – If the LASSO is equivalent to linear regression with a Laplace prior how can there be mass on sets with components at zero

laplace-distributionlasso

We are all familiar with the notion, well documented in the literature, that LASSO optimization (for sake of simplicity confine attention here to the case of linear regression)
$$
{\rm loss} = \| y – X \beta \|_2^2 + \lambda \| \beta \|_1
$$
is equivalent to the linear model with Gaussian errors in which the parameters are given the Laplace prior
$$
\exp(-\lambda \| \beta \|_1 )
$$
We are also aware that the higher one sets the tuning parameter, $\lambda $, the larger the portion of parameters get set to zero. This being said, I have the following thought question:

Consider that from the Bayesian point of view we can calculate the posterior probability that, say, the non-zero parameter estimates lie in any given collection of intervals and the parameters set to zero by the LASSO are equal to zero. What has me confused is, given that the Laplace prior is continuous (in fact absolutely continuous) then how can there be any mass on any set that is a product of intervals and singletons at $\{0\}$?

Best Answer

Like all the comments above, the Bayesian interpretation of LASSO is not taking the expected value of the posterior distribution, which is what you would want to do if you were a purist. If that would be the case, then you would be right that there is very small chance that the posterior would be zero given the data.

In reality, the Bayesian interpretation of LASSO is taking the MAP (Maximum A Posteriori) estimator of the posterior. It sounds like you are familiar, but for anyone who is not, this is basically Bayesian Maximum Likelihood, where you use the value that corresponds to the maximum probability of occurrence (or the mode) as your estimator for the parameters in LASSO. Since the distribution increases exponentially until zero from the negative direction and falls off exponentially in the positive direction, unless your data strongly suggests the beta is some other significant value, the maximum value of value of your posterior is likely to be 0.

Long story short, your intuition seems to be based on the mean of the posterior, but the Bayesian interpretation of LASSO is based on taking the mode of the posterior.

1. Slightly different definitions of $\lambda$

The optimization function used in the article is

$$S(\alpha,\beta;x,y) = \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

The optimization function used by pythons sklearn has a factor $\frac{1}{2n}$ difference.

$$S(\alpha,\beta;x,y) = \frac{1}{2n}\sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

So your code will works when you divide your reg_alpha by 2n.

See the manual

The optimization objective for Lasso is:

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

2. Using the right $\sigma$

In addition, as Tim mentioned, you need to use the $\sigma$ that is the optimum in the Bayesian model.

With $f_\sigma(\sigma)$, $f_\alpha(\alpha)$, $f_\beta(\beta) = \frac{1}{2b} e^{-\frac{|\beta|}{b}}$ your prior functions, and $h(x,y)$ a normalization function, the logarithm of the posterior density is the following

$$\begin{array}{rcllllllll} S(\alpha,\beta,\sigma;x,y) &=& \rlap{\overbrace{\phantom{n -\frac{n}{2} \log 2\pi - n \log \sigma + \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 }}^{\text{likelihood}}} n \log \frac{1}{\sqrt{2\pi\sigma^2}} &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \overbrace{ \log f_\beta(\beta_1) + \log f_\beta(\beta_1)}^{\text{prior of beta}} &+& \overbrace{ \log f_\sigma(\sigma) + \log f_\alpha(\alpha) }^{\text{prior other parameters}} &+& \overbrace{\log h(x,y)}^{\text{normalization function}}\\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} + \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \rlap{\underbrace{\phantom{\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \sum_{p=1}^2 |\beta_i| }}_{\text{part that depends on the $\beta_i$}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &-& \frac{1}{}b\sum_{p=1}^2 |\beta_i| + \log \frac{1}{2b} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \end{array}$$

The part that depends on the $\beta$ is to be compared with the cost function.

$$\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{1}{b} \sum_{p=1}^2 |\beta_i| $$

or multiplied by $\frac{\sigma^2}{n}$

$$\frac{1}{2n} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{\sigma^2}{nb} \sum_{p=1}^2 |\beta_i| $$

this is the same as the first (l1 regularization) cost function if $$\lambda = \frac{\sigma^2}{nb}$$

Here $\sigma$ is the $\sigma$ that is part of the maximum a posteriori probability estimate.

3. Prior on $\alpha$

When you have a prior on $\alpha$ then you get a different estimate for this parameter. This will influence the estimates of the $\beta$ since the $X$ variables correlate with the intercept and they will correct for the the different $\alpha$.

4. Accuracy of the computations

Another source of variation between the two methods is that the algorithms might not converge accurately and due to early stopping you can get slightly different answers.

Code example

The code below combines the four points above

import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso

np.random.seed(1)

k = 2
n = 150

true_sigma = 1
true_alpha = 5

coefs = np.random.rand(k)*3
X = np.random.rand(n, k)
y = (X * coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)

basic_model = pm.Model()
b = 1

with basic_model:
    alpha = pm.Flat("alpha")                       # different prior
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfFlat("sigma")                       # different prior

    mu = alpha + (beta * X).sum(axis=1)

    Y_obs = pm.Normal("Y_obs", mu=mu, sd=sigma, observed=y)

    step = pm.NUTS()
    pm.sample(2000, tune = 1000, step = step)      # different tolerance to have the algorithm converge further
    
map_estimate = pm.find_MAP(model=basic_model,  maxeval=10**6)
#print(map_estimate['sigma'])

# now using Lasso
reg_alpha = map_estimate['sigma']**2 / b / n                 # different alpha
lr = Lasso(alpha=reg_alpha, tol = 1e-6, max_iter = 10^5)     # different tolerance to have the algorithm converge further
lr.fit(X, y)
print("\n")
print(lr.coef_)
print(map_estimate['beta'])

output

[1.15586011 2.45047217]
[1.15586059 2.45047129]

Why is l1 regularization so much faster than find_MAP?

Your question seems to be motivated by the use of Lasso as a faster alternative.

So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.

But the Pymc3 library is an overkill for this problem and that is why it works so relatively slow. The find_MAP function is computing the posterior function by means of (slow) sampling.

This is unnecessary for this problem where an expression for the posterior is already known.
Pymc3 and sampling with a monte carlo method is useful when the expression for the posterior is too complex to be computed by hand or with a computational method or algebraic software program.

For instance, the posterior could involve a step like integration to marginalize out nuisance parameters. This is also already the case here when we look for the posterior of the $\beta$ only. E.g. when we look to maximize the marginal distribution of $\beta$

$$S(\beta;x,y) = \iint S(\alpha,\beta,\sigma;x,y) d\alpha d\sigma$$

^{But to be honest, I am not sure whether find_MAP is optimizing to find the maximum of the joint distribution or to find the individual maxima of the marginal distributions.}

Best Answer

Related Solutions

Lasso – Lambda Value Correspondence in LARS Implementing LASSO

Regularization – Relationship Between Laplace and L1 Regularization