Solved – Bayesian lasso vs spike and slab

bayesianfeature selection

Question: What are the advantages/disadvantages of using one prior over the other for variable selection?

Suppose I have the likelihood:
$$y\sim\mathcal{N}(Xw,\sigma^2I)$$
where I can put either one of the priors:
$$
w_i\sim \pi\delta_0+(1-\pi)\mathcal{N}(0,100)\\
\pi=0.9\,,
$$

or:
$$
w_i\sim \exp(-\lambda|w_i|)\\
\lambda \sim \Gamma(1,1)\,.
$$

I put $\pi=0.9$ to emphasize most of the weights are zero and a gamma prior on $\lambda$ to pick the 'regularizing' parameter.

However, my professor keeps insisting that the lasso version 'shrinks' the coefficients and is not actually doing proper variable selection, i.e. there is an over-shrinkage of even the relevant parameters.

I personally find implementing the Lasso version easier since I use variational Bayes. In fact the Sparse Bayesian Learning paper which effectively puts a prior of $\frac{1}{|w_i|}$ gives even sparser solutions.

Reflection

Since I've left academia, I've had a chance to get some more practical experience in this area. While it is true that spike + slab methods do put a non-zero prior and hence a likelihood, the lasso based methods are (extremely) fast, and it is enough to look at the mean of the weight distribution. When you are dealing with potentially millions of parameters, this matters.

I've also gone on to recognise that my professor is an idiot who can't get past the 90s.

Best Answer

Both of these methods (LASSO vs. spike-and-slab) can be interpreted as Bayesian estimation problems where you are specifying different parameters. One of the main differences is that the LASSO method does not put any point-mass on zero for the prior (i.e., the parameters are almost surely non-zero a priori), whereas the spike-and-slab puts a substantial point-mass on zero.

In my humble opinion, the main advantage of the spike-and-slab method is that it is well-suited to problems where the number of parameters is more than the number of data points, and you want to completely eliminate a substantial number of parameters from the model. Because this method puts a large point-mass on zero in the prior, it will yield posterior estimates that tend to involve only a small proportion of the parameters, hopefully avoiding over-fitting of the data.

When your professor tells you that the former is not performing a variable selection method, what he probably means is this. Under LASSO, each of the parameters is almost surely non-zero a priori (i.e., they are all in the model). Since the likelihood is also non-zero over the parameter support, this will also mean that each is are almost surely non-zero a priori (i.e., they are all in the model). Now, you might supplement this with a hypothesis test, and rule parameters out of the model that way, but that would be an additional test imposed on top of the Bayesian model.

The results of Bayesian estimation will reflect a contribution from the data and a contribution from the prior. Naturally, a prior distribution that is more closely concentrated around zero (like the spike-and-slab) will indeed "shrink" the resultant parameter estimators, relative to a prior that is less concentrated (like the LASSO). Of course, this "shrinking" is merely the effect of the prior information you have specified. The shape of the LASSO prior means that it is shrinking all parameter estimates towards the mean, relative to a flatter prior.

Related Question