Solved – the “effective sample size” of the prior in Bayesian statistics

bayesianprior

In Bayesian statistic, what is the mathematical definition of "effective sample size" of the prior? Could you provide what the "effective sample size" is for the well known classes of conjugate priors? Does this concept generalize to non-conjugate models? Why is the idea of "effective sample size" of the prior it important?

Edit for the bounty: this question is very important and deserves a complete, canonical answer with more examples and proper explanations (trying to address the questions listed above).

Best Answer

Here is an example with a beta prior distribution and a binomial likelihood.

Suppose the prior distribution of the heads probability $\theta$ of a coin is $\mathsf{Beta}(10,10)$ and that $n = 100$ tosses of the coin yield $x = 47$ Heads. Then the posterior distribution of $\theta$ is $\mathsf{Beta}(10 + x, 10 + 100 - x) \equiv \mathsf{Beta}(57, 63).$

This results from Bayes' Theorem, multiplying prior $f(\theta)$ by likelihood $g(x|\theta)$ to get posterior $h(\theta|x):$

$$f(\theta)\times g(x|\theta) \propto \theta^{10-1}(1-\theta)^{10-1} \times \theta^{x}(1-\theta)^{n-x}\\ \propto h(\theta|x) \propto \theta^{(10+x)-1}(1-\theta)^{(10+100-x)-1} \propto \theta^{57-1}(1-\theta)^{63-1}.$$

One could say that the prior distribution is 'effectively' equivalent to advance knowledge of $20$ tosses of the coin yielding 10 heads.

Note: In the displayed relationship for Bayes' Theorem, use of the symbol $\propto$ (read "proportional to"), instead of $=,$ recognizes that we are showing the kernels (density functions without their norming constants) of the prior, likelihood, and posterior. Because the prior and likelihood are 'conjugate' (mathematically compatible), we can recognize the expression on the right as the kernel of $\mathsf{Beta}(57, 63).$