The product of the two densities in
$$
p(\boldsymbol{\Lambda_0 | X, \Lambda}, \upsilon, D, \boldsymbol{\Lambda_x}) \propto
\mathcal{W}(\boldsymbol{\Lambda} | \upsilon, \boldsymbol{\Lambda_0})
\mathcal{W}(\boldsymbol{\Lambda_0} |D, \frac{1}{D}\boldsymbol{\Lambda_x}) \\
$$
leads to
\begin{align*}
p(\boldsymbol{\Lambda_0 | X, \Lambda}, \upsilon, D, \boldsymbol{\Lambda_x}) &\propto
|\boldsymbol{\Lambda_0}|^{-\upsilon/2}\,\exp\{-\text{tr}(\boldsymbol{\Lambda_0}^{-1}\boldsymbol{\Lambda})/2\}\\ &\times |\boldsymbol{\Lambda_0}|^{(D-p-1)/2}\,\exp\{-D\,\text{tr}(\boldsymbol{\Lambda_x}^{-1}\boldsymbol{\Lambda_0})/2\}\\
&\propto|\boldsymbol{\Lambda_0}|^{(D-\upsilon-p-1)/2}\,\exp\{-tr(\boldsymbol{\Lambda_0}^{-1}\boldsymbol{\Lambda}+D\,\boldsymbol{\Lambda_x}^{-1}\boldsymbol{\Lambda_0})/2\}\,,
\end{align*}
which does not appear to be a standard density. To keep conjugacy of sorts, the right hierarchical prior on $\boldsymbol{\Lambda_0}$ should be something like
$$
\boldsymbol{\Lambda_0}\sim\mathcal{IW}(\boldsymbol{\Lambda_0} |D, \frac{1}{D}\boldsymbol{\Lambda_x})\,.
$$
I'll answer your third question first and address your other two later.
- What do you think he means when saying "Lasso (and related)...but not in the posterior distribution"?
This figure from his slides shows what he means. Expressing the lasso regularizer as a prior distribution means your prior distribution will take the form of a Laplacian or double-exponential distribution. This distribution has a characteristic non-smooth peak at the mean, which is set to 0 to achieve a sparse regularization effect. To directly get a lasso regularized result, you should take the mode of your posterior distribution.
In the figure, the blue dashed line represents the Laplacian prior distribution. The posterior distribution, in solid black, has its mode at 0 on the left with a weak likelihood, while the mode is non-zero on the right with a strong likelihood.
However, the full posterior distribution is not sparse, because if you sample from it you will only rarely get some value close to 0, and in fact because it's a continuous distribution you will never get precisely 0.
In order to achieve sparsity with a lasso approach, you typically need to set some cutoff threshold on the posterior mode. The ideal case is if your posterior mode is equal to 0, but you could relax this and eliminate your variable if its posterior mode is less than 0.2 after taking the absolute value.
Performing this sparsification under lasso gives a particular set of eliminated and retained regressors, which is the "single decision" about which regressors are included or excluded.
A fully Bayesian approach to variable selection, the spike and slab prior, retains uncertainty about which variables should be included or excluded all the way through the model.
So to address your first question:
- Are they better in the sense that they basically use a brute force approach testing each possible subset of regressors to include?
This is a misunderstanding, since neither method tests all possible subsets of regressors to include.
- Is the drawback the computation time in doing so?
This is also a misunderstanding, since the computation time is not dominated by brute force testing each possible subset of regressors.
To clarify Scott's point, given some data, if you use a penalized likelihood sparsification approach, you will get exactly one set of included and excluded regressors. But if you use a spike and slab sparsification approach, you have a full posterior distribution for each regressor, each with a separate likelihood of being included or excluded. Some regressors might have a 70% chance of being included, others a 25% chance. This can be preferable in many applications, because given a single dataset we should still have uncertainty over which regressors are important or not.
Intuitively, a spike and slab prior better represents the possible space of included/excluded regressors compared to a penalized likelihood approach like lasso.
Best Answer
Both of these methods (LASSO vs. spike-and-slab) can be interpreted as Bayesian estimation problems where you are specifying different parameters. One of the main differences is that the LASSO method does not put any point-mass on zero for the prior (i.e., the parameters are almost surely non-zero a priori), whereas the spike-and-slab puts a substantial point-mass on zero.
In my humble opinion, the main advantage of the spike-and-slab method is that it is well-suited to problems where the number of parameters is more than the number of data points, and you want to completely eliminate a substantial number of parameters from the model. Because this method puts a large point-mass on zero in the prior, it will yield posterior estimates that tend to involve only a small proportion of the parameters, hopefully avoiding over-fitting of the data.
When your professor tells you that the former is not performing a variable selection method, what he probably means is this. Under LASSO, each of the parameters is almost surely non-zero a priori (i.e., they are all in the model). Since the likelihood is also non-zero over the parameter support, this will also mean that each is are almost surely non-zero a priori (i.e., they are all in the model). Now, you might supplement this with a hypothesis test, and rule parameters out of the model that way, but that would be an additional test imposed on top of the Bayesian model.
The results of Bayesian estimation will reflect a contribution from the data and a contribution from the prior. Naturally, a prior distribution that is more closely concentrated around zero (like the spike-and-slab) will indeed "shrink" the resultant parameter estimators, relative to a prior that is less concentrated (like the LASSO). Of course, this "shrinking" is merely the effect of the prior information you have specified. The shape of the LASSO prior means that it is shrinking all parameter estimates towards the mean, relative to a flatter prior.