Solved – Why are weakly informative priors a good idea

bayesianprioruninformative-prior

There are many solutions to the problem that typically not enough information is available to fully specify a prior.

For all approaches (I know) but weakly informative priors I kind of understand why they are reasonable solutions to the problem: "There is not enough information to specify a precise prior." For weakly informative priors this connection escapes me. To give an example, let's say I want to estimate the mean of a Normal distribution but have absolutely no prior information. Using a typical weakly informative prior $\mathcal{N}(0,\sigma^2)$ with $\sigma$ being large, I do not see how this is an implementation of no prior knowledge. Instead, it roughly corresponds to any value being equally likely, which is completely different compared to complete uncertainty about the prior as is for example achieved by minimax methods.

Even in the other situation where weakly informative priors $\mathcal{N}(0,\sigma^2)$ seem to be often used, namely for implementing the prior knowledge that deviation from $0$ become equally unlikely weakly informative priors seem to be to optimistic in the sense that they assume a fixed $\sigma$. Alternatives like estimation $\sigma$ (Empirical Bayes), putting a hyper-prior on $\sigma$ (Hierachical Bayes), considering all $\sigma$s a possibility without assigning each $\sigma$ a probability (Gamma-Minimax) seem all to be a better suited.

At the same time, weakly informative priors are recommended by many experts (for example, in https://www.amazon.de/Bayesian-Analysis-Chapman-Statistical-Science/dp/1439840954) as a good solution to the problem that there is not enough information to specify a precise prior.

So, what am I missing? I have looked in https://www.amazon.de/Bayesian-Analysis-Chapman-Statistical-Science/dp/1439840954 but the arguments presented are more in terms of favorable properties (regularizes, weaker than the actual prior, let's the likelihood speak) and not why this approach is a good approach for implementing uncertainty in the prior and should typically preferred over alternative approaches.

Best Answer

I do not see how this is an implementation of no prior knowledge. Instead, it roughly corresponds to any value being equally likely, which is completely different compared to complete uncertainty about the prior as is for example achieved by minimax methods.

It appears that you are confusing prior ignorance about the parameter value, with prior ignorance about the prior distribution itself. The latter is not a requirement of Bayesian analysis. Clearly, you cannot specify the mathematical form of a prior distribution and also claim ignorance about that distribution --- any specification of a prior distribution constitutes perfect knowledge of the prior. So, the goal here is to specify a prior that captures ignorance about the parameter value. A weakly informative prior has the following general benefits:

It represents genuine prior ignorance: A weakly informative prior will gives a reasonable representation of genuine ignorance about the parameter. For example, if you use an improper uniform prior for a mean parameter (over all the real numbers) then every value has equal density. This representation comes from the principle of insufficient reason (formulated by Jacob Bernoulli, but more commonly associated with Laplace).$^ \dagger$ Uniformity of distribution on an appropriate measurement scale means that the prior does not strongly favour particular values of the parameter.
It does not contribute strongly to the posterior: The prior and likelihood functions both contribute to the posterior. There are various techniques to measure the contribution of each of these functions. For example, when using a conjugate prior, the contribution of the prior can be measured as a number of pseudo data points. With a weakly informative prior the number of pseudo data points in the prior is low (usually one or less). In such cases we sometimes say that this small contribution from the prior "lets the data speak for itself".
It allows us to make objective inferences: In objective Bayesian analysis we formulate a method of prior selection that leads to a unique prior (i.e., it does not have variable hyperparameters). Virtually every approach to objective Bayesian analysis formulates the prior based on some argument to ignorance, yielding a weakly informative prior.$^ {\dagger \dagger}$

Of course, you should bear in mind that all of these arguments apply to a situation where we want to avoid adding information about the parameter into the prior. If we have genuine prior information that we want to incorporate into the analysis then we will generally want to eschew a weakly informative prior in favour of one that capture that information.

$^ \dagger$ Note that when the "principle of insufficient reason" is applied to continuous random variables, the variable should be on a proper scale where uniformity is an appropriate representation. A nonlinear transform of a uniform random variable is non-uniformly distributed, which means that one must decide which representation of the variable is uniform. For a mean parameter we generally take this to be uniform on its initial scale, but for a variance parameter we usually take this to be uniform on its log-scale (i.e., after a logarithmic transformation).

$^ {\dagger \dagger}$ Note that there are different theories here of what prior is appropriate (e.g., Jeffrey's prior, Jaynes max-ent prior), so there are still multiple competing priors at a theoretical level. However, once you subscribe to a particular theory, you can then formulate how ignorance is represented objectively in particular cases. (In any case, most of the competing theories use very similar priors, so there is usually very little difference in the posterior under any of these theories.)

Best Answer

Related Solutions

Solved – half-cauchy prior for scale parameter

Solved – Bayesian lasso vs spike and slab

Related Question