There are four points of improvement to make the relationship between the l1 regularization and the Bayesian MAP estimate equivalent.
1. Slightly different definitions of $\lambda$
The optimization function used in the article is
$$S(\alpha,\beta;x,y) = \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$
The optimization function used by pythons sklearn has a factor $\frac{1}{2n}$ difference.
$$S(\alpha,\beta;x,y) = \frac{1}{2n}\sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$
So your code will works when you divide your reg_alpha
by 2n.
See the manual
The optimization objective for Lasso is:
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1
2. Using the right $\sigma$
In addition, as Tim mentioned, you need to use the $\sigma$ that is the optimum in the Bayesian model.
With $f_\sigma(\sigma)$, $f_\alpha(\alpha)$, $f_\beta(\beta) = \frac{1}{2b} e^{-\frac{|\beta|}{b}}$ your prior functions, and $h(x,y)$ a normalization function, the logarithm of the posterior density is the following
$$\begin{array}{rcllllllll}
S(\alpha,\beta,\sigma;x,y) &=& \rlap{\overbrace{\phantom{n -\frac{n}{2} \log 2\pi - n \log \sigma + \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 }}^{\text{likelihood}}}
n \log \frac{1}{\sqrt{2\pi\sigma^2}} &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \overbrace{ \log f_\beta(\beta_1) + \log f_\beta(\beta_1)}^{\text{prior of beta}} &+& \overbrace{ \log f_\sigma(\sigma) + \log f_\alpha(\alpha) }^{\text{prior other parameters}} &+& \overbrace{\log h(x,y)}^{\text{normalization function}}\\
&=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} + \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \\
&=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \rlap{\underbrace{\phantom{\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \sum_{p=1}^2 |\beta_i| }}_{\text{part that depends on the $\beta_i$}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &-& \frac{1}{}b\sum_{p=1}^2 |\beta_i| + \log \frac{1}{2b} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y)
\end{array}$$
The part that depends on the $\beta$ is to be compared with the cost function.
$$\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{1}{b} \sum_{p=1}^2 |\beta_i| $$
or multiplied by $\frac{\sigma^2}{n}$
$$\frac{1}{2n} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{\sigma^2}{nb} \sum_{p=1}^2 |\beta_i| $$
this is the same as the first (l1 regularization) cost function if $$\lambda = \frac{\sigma^2}{nb}$$
Here $\sigma$ is the $\sigma$ that is part of the maximum a posteriori probability estimate.
3. Prior on $\alpha$
When you have a prior on $\alpha$ then you get a different estimate for this parameter. This will influence the estimates of the $\beta$ since the $X$ variables correlate with the intercept and they will correct for the the different $\alpha$.
4. Accuracy of the computations
Another source of variation between the two methods is that the algorithms might not converge accurately and due to early stopping you can get slightly different answers.
Code example
The code below combines the four points above
import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso
np.random.seed(1)
k = 2
n = 150
true_sigma = 1
true_alpha = 5
coefs = np.random.rand(k)*3
X = np.random.rand(n, k)
y = (X * coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)
basic_model = pm.Model()
b = 1
with basic_model:
alpha = pm.Flat("alpha") # different prior
beta = pm.Laplace("beta", mu=0, b=b, shape=k)
sigma = pm.HalfFlat("sigma") # different prior
mu = alpha + (beta * X).sum(axis=1)
Y_obs = pm.Normal("Y_obs", mu=mu, sd=sigma, observed=y)
step = pm.NUTS()
pm.sample(2000, tune = 1000, step = step) # different tolerance to have the algorithm converge further
map_estimate = pm.find_MAP(model=basic_model, maxeval=10**6)
#print(map_estimate['sigma'])
# now using Lasso
reg_alpha = map_estimate['sigma']**2 / b / n # different alpha
lr = Lasso(alpha=reg_alpha, tol = 1e-6, max_iter = 10^5) # different tolerance to have the algorithm converge further
lr.fit(X, y)
print("\n")
print(lr.coef_)
print(map_estimate['beta'])
output
[1.15586011 2.45047217]
[1.15586059 2.45047129]
Why is l1 regularization so much faster than find_MAP?
Your question seems to be motivated by the use of Lasso as a faster alternative.
So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.
But the Pymc3 library is an overkill for this problem and that is why it works so relatively slow. The find_MAP function is computing the posterior function by means of (slow) sampling.
- This is unnecessary for this problem where an expression for the posterior is already known.
- Pymc3 and sampling with a monte carlo method is useful when the expression for the posterior is too complex to be computed by hand or with a computational method or algebraic software program.
For instance, the posterior could involve a step like integration to marginalize out nuisance parameters. This is also already the case here when we look for the posterior of the $\beta$ only. E.g. when we look to maximize the marginal distribution of $\beta$
$$S(\beta;x,y) = \iint S(\alpha,\beta,\sigma;x,y) d\alpha d\sigma$$
But to be honest, I am not sure whether find_MAP is optimizing to find the maximum of the joint distribution or to find the individual maxima of the marginal distributions.
Best Answer
Like all the comments above, the Bayesian interpretation of LASSO is not taking the expected value of the posterior distribution, which is what you would want to do if you were a purist. If that would be the case, then you would be right that there is very small chance that the posterior would be zero given the data.
In reality, the Bayesian interpretation of LASSO is taking the MAP (Maximum A Posteriori) estimator of the posterior. It sounds like you are familiar, but for anyone who is not, this is basically Bayesian Maximum Likelihood, where you use the value that corresponds to the maximum probability of occurrence (or the mode) as your estimator for the parameters in LASSO. Since the distribution increases exponentially until zero from the negative direction and falls off exponentially in the positive direction, unless your data strongly suggests the beta is some other significant value, the maximum value of value of your posterior is likely to be 0.
Long story short, your intuition seems to be based on the mean of the posterior, but the Bayesian interpretation of LASSO is based on taking the mode of the posterior.