Solved – What are pros and cons of empirical Bayesian methods

bayesian

The empirical Bayesian method is a new concept to me. It raises my interest, because it offers a different philosophical and methodological perspective of statistical analysis.

From my limited knowledge, I am not sure if it is useful/powerful, or inferior to alternative methods, and if it is worth learning (and to what degree) compared to learning other useful topics in Bayesian analysis?

How do you think about it?
What are pros and cons of empirical Bayesian methods?

Best Answer

So we are clear, the idea is that I have data $Y \sim f(Y \mid \theta)$ and have a prior $\theta \sim \pi(\theta \mid \eta)$. Then the joint is $$ J(Y, \theta \mid \eta) = f(Y\mid \theta)\pi(\theta\mid \eta) $$ and the marginal of $Y$ is $$ m(Y\mid\eta)=\int f(Y\mid\theta) \pi(\theta\mid\eta) \ d\theta. $$ The empirical Bayes approach, rather than specifying the value of $\eta$ or placing a prior on $\eta$, estimates $$ \hat \eta = \arg \max_\eta m(Y\mid \eta). $$ Then, we draw inferences about $\theta$ from the "posterior" $\pi(\theta \mid Y, \hat \eta)$.

This describes parametric empirical Bayes. Maybe someone else can describe the situation for nonparametric empirical Bayes; I haven't dealt with it personally. The primary alternative to EB is to place a prior on $\eta$.

Some Pros

  1. The procedure is, in principle, automatic. No work need to be done in eliciting a prior on $\eta$. Contrast this with choosing $\eta$ according to your prior knowledge about $\theta$, or using a hyperprior $\eta \sim \lambda(\eta \mid \gamma)$ (which will require specifying a value of $\gamma$). Subjectivity is always creeping in with these alternative approaches.

    In practice, it can be very annoying to have to specify a prior. It can cause a lot of work on the part of the scientist. Empirical Bayes can shift the workload to the computer.

  2. Related to 1., I've found that this can provide some stabilization of our results. Normally I would try to place a prior on $\eta$, but if my prior is too vague or out of line with the data, I find you can get some strange results. I'm more likely to get a sane answer with empirical Bayes. (Note: This is my personal experience with the models I've worked with; it is easy to imagine EB overfitting for the same reason ML results in overfitting).

Some Cons

  1. It is not always easy to implement. What you can get away with in implementation depends on what problem you are looking at - if you are in an ML setting and are doing some variational approximation for inference you can often do some approximate EB, but if you are doing MCMC it can be quite difficult to implement in a computationally attractive way. Under MCMC you can try to fake things with a stochastic search algorithm, but as far as I know the theory behind this hasn't really been done.

  2. By plugging in fixed point estimate $\hat \eta$ in place of $\eta$ and drawing inference from $\pi(\theta \mid Y, \hat \eta)$ as though we had specified $\hat \eta$ from the beginning, we are neglecting our uncertainty about $\eta$. There are ways to try to fix this, but mostly people just hope that it doesn't make a big difference. But if it really didn't make a difference, why not just put a hyperprior on $\eta$? This is especially suspect because the amount of information in the data about $\eta$ is often quite small.

  3. It isn't clear what exactly we are doing from a statistical perspective. It isn't really Bayesian; at best, it is an approximation to Bayesian analysis. Hypothetically, if there was a prior on $\eta$ and it was tightly concentrated, then EB would be an approximation to fully Bayesian inference, but this typically isn't the case. So what the heck is this procedure doing? It seems to me that if I'm using this I'm usually either being a fake Bayesian or I have some reason to believe that the frequentist properties of the method are good. The principled Bayesian approach would be to put a prior on $\eta$, and this can work better in practice.

Hope that helps. I actually like EB quite a bit as a method for finding procedures and evaluating them according to their frequentist properties when I'm wearing my statistics hat. It gives frequentists a nice tool for constructing methods with "sharing of information" in hierarchical models. Occasionally the properties of EB estimators are provably good (e.g. the Stein shrinkage estimator can be derived from an EB standpoint). In ML, of course, you often just don't really care where procedures come from and just use whatever works.