Solved – Understanding convergence in Bayesian inference of coin tossing

bayesianconvergenceinference

When we are uncertain about the probability of head, $p_H$, in a coin tossing, we often model it using a Beta prior as follows:
$$p_H\sim \text{Beta}(a_0,b_0),$$
for some parameters $a_0,b_0$.

When we toss the coin $N$ times and when we get $N_H$ and $N_T$ number of heads and tails, respectively, the posterior we have is $$p_H\sim \text{Beta}(a_0+N_H,b_0+N_T).$$

So, the mean value of $p_H$ is $\frac{a_0+N_H}{a_0+N_H+b_0+N_T}$ with a variance $\frac{(a_0+N_H)(a_0+N_H)}{N^2(N+1)}.$

My question here is

Can we say the posterior distribution converges to the "true" distribution?

when the true distribution (or the true scalar value of $p_H$) is never going to be observed?

Best Answer

Yes, it does converge to the "true distribution" (suitably defined)

First of all, it is worth noting that it is a little strange to refer to the "true distribution" of the parameter as something aside from the prior and posterior. If you proceed under the operational Bayesian approach then the parameter has an operational definition as a function of the sequence of observable values, and so it is legitimate to refer to a "true value" of the parameter (see esp. Bernardo and Smith 2000). The standard convergence theorems in Bayesian statistics show that the posterior converges weakly to the true parameter, defined operationally through the law-of-large numbers. It is less common to refer to a "true distribution" of the parameter, as something apart from the prior or posterior. If I understand your intention correctly, that would essentially just be a point-mass distribution on the true value. If that is what you mean, then yes, the posterior will converge (weakly) to this. That is really just another restatement of the standard convergence theorems in Bayesian statistics.

Taking $\mathbf{X} = (X_1, X_2, X_3, ...)$ to be the sequence of observable coin-toss outcomes, we define $p_H$ operationally as a function of $\mathbf{X}$. In Bayesian analysis with IID data the most operational definition of the parameters is an index to the limiting empirical distribution of the observable sequence (see O'Neill 2009 for discussion and details). Now, the beta posterior you are referring to arises in the IID model:

$$X_1,X_2,X_3,... | p_H \sim \text{IID Bern}(p_H).$$

The parameter $p_H$ can be given an operational definition as the limit of the sample mean of the observable coin-tosses (see O'Neill 2009 again). To facilitate our analysis, we will use the notation $\hat{p}_H \equiv \lim N_H/N$ to denote this limit, so the "true value" of the parameter $p_H$ is the point $\hat{p}_H$. (In other words $p_H$ is $\hat{p}_H$, but we will use two separate referents to elucidate the convergence.)

Your posterior distribution does indeed converge to the "true distribution" of $p_H$, which is a point-mass distribution on $\hat{p}_H$. To see this, we first derive the asymptotic mean and variance$^\dagger$ of the posterior:

$$\begin{equation} \begin{aligned} \lim_{N \rightarrow \infty} \mathbb{E}(p_H| \mathbf{X}_N) &= \lim_{N \rightarrow \infty} \frac{a_0+N_H}{a_0 + b_0 + N} \\[6pt] &= \lim_{N \rightarrow \infty} \frac{N_H}{N} \cdot \frac{a_0+N_H}{N_H} \Bigg/ \frac{a_0+b_0+N}{N} \\[6pt] &\overset{a.s}{=} \lim_{N \rightarrow \infty} \frac{N_H}{N} = \hat{p}_H. \\[6pt] \lim_{N \rightarrow \infty} \mathbb{V}(p_H| \mathbf{X}_N) &= \lim_{N \rightarrow \infty} \frac{(a_0+N_H)(b_0+N_T)}{(a_0 + b_0 + N)^2(a_0 + b_0 + N+1)} \\[6pt] &= \lim_{N \rightarrow \infty} \frac{a_0+N_H}{a_0 + b_0 + N} \cdot \frac{b_0+N_T}{a_0 + b_0 + N} \cdot \frac{1}{a_0 + b_0 + N + 1} \\[6pt] &\leqslant \lim_{N \rightarrow \infty} \frac{1}{a_0 + b_0 + N + 1} \\[6pt] &= 0. \\[6pt] \end{aligned} \end{equation}$$

So, we have $\mathbb{E}(p_H| \mathbf{X}_N) \overset{a.s}{\rightarrow} \hat{p}_H$ and $\mathbb{V}(p_H| \mathbf{X}_N) \rightarrow 0$, which gives convergence in mean-square to the true parameter value $\hat{p}_H$. Using Markov's inequality this implies convergence in probability to $\hat{p}_H$, which further implies convergence in probability of the posterior to the point-mass distribution on $\hat{p}_H$. This means that we have weak convergence to the "true distribution" of the parameter.


$^\dagger$ Your stated posterior variance is incorrect - I have used the correct posterior variance in my working.