Solved – Bayesian spike and slab versus penalized methods

bayesianbstsfeature selectionrregularization

I'm reading Steven Scott's slides about BSTS R package (You can find them here: slides).

At some point, when talking about including many regressors in the structural time series model he introduces the spike and slab priors of regression coefficients and says that they are better compared to penalized methods.

Scott says, referring to an example of a dataset with 100 predictors:

  • Penalized methods make a single decision about which variables are included/excluded,that means that they decide one subset of predictors i.e. one model among the $2^{100}$ possible ones.
  • "Lasso (and related) priors are not sparse, they induce sparsity at the mode but not in the posterior distribution"

At this point he introduces the Spike and Slab priors.

I think I got the intuition, but I want to be sure about it:

  • Are they better in the sense that they basically use a brute force approach testing each possible subset of regressors to include?
  • Is the drawback the computation time in doing so?
  • What do you think he means when saying "Lasso (and related)…but not in the posterior distribution"?

Best Answer

I'll answer your third question first and address your other two later.

  1. What do you think he means when saying "Lasso (and related)...but not in the posterior distribution"?

This figure from his slides shows what he means. Expressing the lasso regularizer as a prior distribution means your prior distribution will take the form of a Laplacian or double-exponential distribution. This distribution has a characteristic non-smooth peak at the mean, which is set to 0 to achieve a sparse regularization effect. To directly get a lasso regularized result, you should take the mode of your posterior distribution.

test

In the figure, the blue dashed line represents the Laplacian prior distribution. The posterior distribution, in solid black, has its mode at 0 on the left with a weak likelihood, while the mode is non-zero on the right with a strong likelihood.

However, the full posterior distribution is not sparse, because if you sample from it you will only rarely get some value close to 0, and in fact because it's a continuous distribution you will never get precisely 0.

In order to achieve sparsity with a lasso approach, you typically need to set some cutoff threshold on the posterior mode. The ideal case is if your posterior mode is equal to 0, but you could relax this and eliminate your variable if its posterior mode is less than 0.2 after taking the absolute value.

Performing this sparsification under lasso gives a particular set of eliminated and retained regressors, which is the "single decision" about which regressors are included or excluded.

A fully Bayesian approach to variable selection, the spike and slab prior, retains uncertainty about which variables should be included or excluded all the way through the model.

So to address your first question:

  1. Are they better in the sense that they basically use a brute force approach testing each possible subset of regressors to include?

This is a misunderstanding, since neither method tests all possible subsets of regressors to include.

  1. Is the drawback the computation time in doing so?

This is also a misunderstanding, since the computation time is not dominated by brute force testing each possible subset of regressors.

To clarify Scott's point, given some data, if you use a penalized likelihood sparsification approach, you will get exactly one set of included and excluded regressors. But if you use a spike and slab sparsification approach, you have a full posterior distribution for each regressor, each with a separate likelihood of being included or excluded. Some regressors might have a 70% chance of being included, others a 25% chance. This can be preferable in many applications, because given a single dataset we should still have uncertainty over which regressors are important or not.

Intuitively, a spike and slab prior better represents the possible space of included/excluded regressors compared to a penalized likelihood approach like lasso.

Related Question