First let's look at Mitchell and Beauchamp (1988)[1], for a description of what a spike-and-slab prior is:
That is, $\beta_j$ is uniformly distributed between the two limits $-f_j$ and $f_j$, except for a bit of probability mass concentrated at 0 if $x_j$ is vulnerable to deletion. We are interested in taking $f_j$ to be very large for all $j$, ...
Now if $f_j$ is large-but-finite, this is a proper prior - we can even write down the cdf explicitly.
(You'll sometimes see people actually draw something like this: - which might help with picturing it in some sense, but the problem with that is what then does the y-axis represent? It can't be density because the spike represents probability, and it can't be probability because the uniform part represents density. The two parts are on completely different scales. This seems to encourage the mistaken notion of conflating probability and density.)
Mitchell and Beauchamp keep $f_j$ finite but assume that it's large enough that the relevant integrals from $-f_j$ to $f_j$ can be well approximated by integrals from $-\infty$ to $\infty$.
If, however, we take the limit as $f_j\to\infty$ then of course it wouldn't be a proper prior. When being used as a prior for variable selection this generally isn't going to be done because of the way that impacts variable selection (try it for a simple case).
Other priors have been given the name "spike and slab" since -- including the case with a Gaussian slab, as you mention. In that case, the prior is proper as long as the variance of the normal is finite.
[1]: Mitchell T.J. and Beauchamp, J.J. (1988),
"Bayesian Variable Selection in Linear Regression"
Journal of the American Statistical Association, Vol. 83, No. 404 (Dec.),1023-1032
Both of these methods (LASSO vs. spike-and-slab) can be interpreted as Bayesian estimation problems where you are specifying different parameters. One of the main differences is that the LASSO method does not put any point-mass on zero for the prior (i.e., the parameters are almost surely non-zero a priori), whereas the spike-and-slab puts a substantial point-mass on zero.
In my humble opinion, the main advantage of the spike-and-slab method is that it is well-suited to problems where the number of parameters is more than the number of data points, and you want to completely eliminate a substantial number of parameters from the model. Because this method puts a large point-mass on zero in the prior, it will yield posterior estimates that tend to involve only a small proportion of the parameters, hopefully avoiding over-fitting of the data.
When your professor tells you that the former is not performing a variable selection method, what he probably means is this. Under LASSO, each of the parameters is almost surely non-zero a priori (i.e., they are all in the model). Since the likelihood is also non-zero over the parameter support, this will also mean that each is are almost surely non-zero a priori (i.e., they are all in the model). Now, you might supplement this with a hypothesis test, and rule parameters out of the model that way, but that would be an additional test imposed on top of the Bayesian model.
The results of Bayesian estimation will reflect a contribution from the data and a contribution from the prior. Naturally, a prior distribution that is more closely concentrated around zero (like the spike-and-slab) will indeed "shrink" the resultant parameter estimators, relative to a prior that is less concentrated (like the LASSO). Of course, this "shrinking" is merely the effect of the prior information you have specified. The shape of the LASSO prior means that it is shrinking all parameter estimates towards the mean, relative to a flatter prior.
Best Answer
I'll answer your third question first and address your other two later.
This figure from his slides shows what he means. Expressing the lasso regularizer as a prior distribution means your prior distribution will take the form of a Laplacian or double-exponential distribution. This distribution has a characteristic non-smooth peak at the mean, which is set to 0 to achieve a sparse regularization effect. To directly get a lasso regularized result, you should take the mode of your posterior distribution.
In the figure, the blue dashed line represents the Laplacian prior distribution. The posterior distribution, in solid black, has its mode at 0 on the left with a weak likelihood, while the mode is non-zero on the right with a strong likelihood.
However, the full posterior distribution is not sparse, because if you sample from it you will only rarely get some value close to 0, and in fact because it's a continuous distribution you will never get precisely 0.
In order to achieve sparsity with a lasso approach, you typically need to set some cutoff threshold on the posterior mode. The ideal case is if your posterior mode is equal to 0, but you could relax this and eliminate your variable if its posterior mode is less than 0.2 after taking the absolute value.
Performing this sparsification under lasso gives a particular set of eliminated and retained regressors, which is the "single decision" about which regressors are included or excluded.
A fully Bayesian approach to variable selection, the spike and slab prior, retains uncertainty about which variables should be included or excluded all the way through the model.
So to address your first question:
This is a misunderstanding, since neither method tests all possible subsets of regressors to include.
This is also a misunderstanding, since the computation time is not dominated by brute force testing each possible subset of regressors.
To clarify Scott's point, given some data, if you use a penalized likelihood sparsification approach, you will get exactly one set of included and excluded regressors. But if you use a spike and slab sparsification approach, you have a full posterior distribution for each regressor, each with a separate likelihood of being included or excluded. Some regressors might have a 70% chance of being included, others a 25% chance. This can be preferable in many applications, because given a single dataset we should still have uncertainty over which regressors are important or not.
Intuitively, a spike and slab prior better represents the possible space of included/excluded regressors compared to a penalized likelihood approach like lasso.