Consistency of Dirichlet Distribution

measure-theoryprobabilityprobability distributions

Let $Y_{i}=(X_{1i}, X_{2i},…, X_{Ki})\sim \operatorname{Multinomial}(N,(p_{1}, p_{2},…, p_{K})) \ \ i=1, 2,…, n$ iid random draws from a $\operatorname{Multinomial}$ distribution, i.e. $\sum_{k=1}^{K}X_{ki}=N$ and $\sum_{k=1}^{K}p_{k}=1$.

If we want to make $\textit{Bayesian Inference},$ a convenient prior choice for the probabilities $p_{k} \ \ i=k, 2,…, K$, that we want to infer, is a $\textit{Dirichlet}$ distribution.

$p=(p_{1}, p_{2},…, p_{K})\sim \operatorname{Dirichlet}(a_{1}, a_{2},…, a_{K})$

The $\textit{Dirichlet}$ prior is conjugate to the $\textit{Multinomial}$ likelihood, hence the posterior distribution will be again a $\textit{Dirichlet}$.

$p=(p_{1}, p_{2},…, p_{K})|(Y_{1}, Y_{2},…, Y_{n})\sim \operatorname{Dirichlet}(a_{1} + \sum_{i=1}^{n}X_{1i}, a_{2}+\sum_{i=1}^{n}X_{2i},…, a_{k}+\sum_{i=1}^{n}X_{Ki})$


A definition of posterior consistency that I found in this question https://stats.stackexchange.com/questions/320104/proof-of-posterior-consistency says

“given the data $x_1,…,x_n$ iid and the model $f(x;\theta)$, if we have posterior consistency at $\theta_0$, then
$\forall \epsilon > 0$,
$\lim_{n\rightarrow +\infty}\pi(|\theta-\theta_0|>\epsilon|X)=0$.”


My goal is to prove that this type of consistency holds for the bayesian model described earlier, i.e. I assume I have to prove consistency at $p^{0} = (p^{0}_{1}, p^{0}_{2},…, p^{0}_{K})$.

For that I have to calculate the
$$\lim_{n\rightarrow \infty}\pi(\left \| p-p^{0} \right \|\geq \epsilon|Y_{1}, Y_{2},…, Y_{n}) =0 $$


I have to start by specifying a norm, suppose that I use the $\textit{Euclidean}$ norm (we could also use the $L_{1}$ norm), i.e. $\left \| p-p_{0} \right \| = \sqrt{\sum_{k=1}^{K}(p_{k}-p_{k}^{0})^{2}}$

$$\pi(\sqrt{\sum_{k=1}^{K}(p_{k}-p_{k}^{0})^{2}}\geq \epsilon\mid Y_{1}, Y_{2},…, Y_{n})$$

How do I move forward with proving that?


$\textbf{Update}$, I tried to examine what happens when $K=2$

$$\pi(\sqrt{\sum_{k=1}^{2}(p_{k}-p_{k}^{0})^{2}}\geq \epsilon\mid Y_{1}, Y_{2},…, Y_{n}) = \pi(\sqrt{(p_{1}-p_{1}^{0})^{2}+(p_{2}-p_{2}^{0})^{2}} \geq \epsilon |Y_{1}, Y_{2},…, Y_{n})$$

$$=\pi(\sqrt{(p_{1}-p_{1}^{0})^{2}+(1-p_{1}-1+p_{1}^{0})^{2}} \geq \epsilon |Y_{1}, Y_{2},…, Y_{n}) = \pi(\sqrt{2(p_{1}-p_{1}^{0})^{2}} \geq \epsilon |Y_{1}, Y_{2},…, Y_{n}) = \pi(\sqrt{2}\left | p_{1}-p_{1}^{0} \right | \geq \epsilon |Y_{1}, Y_{2},…, Y_{n})$$

Then the posterior mean and variance of $p_{1}$ are

$$M = \mathbb{E}[p_{1}|Y_{1}, Y_{2},…, Y_{n}] = \frac{\sum_{k=1}^{2}a_{k}}{\sum_{k=1}^{2}a_{k}+nN}(\frac{a_{1}}{\sum_{k=1}^{2}a_{k}}) + \frac{nN}{\sum_{k=1}^{2}+nN}(\frac{\sum_{i=1}^{n}X_{1i}}{nN})$$

$$V = \frac{(a_{1}+\sum_{i=1}^{n}X_{1i})(a_{1}+a_{2}+nN-a_{1}-\sum_{i=1}^{n}X_{1i})}{(a_{1}+a_{2}+nN)^{2}(a_{1}+a_{2}+nN+1)}\rightarrow 0$$
as $n\rightarrow \infty$. Similarly with the argument used in this post https://stats.stackexchange.com/questions/320104/proof-of-posterior-consistency, we have

$$\pi(\left | p_{1}-p_{1}^{0} \right | \geq \epsilon/\sqrt{2} |Y_{1}, Y_{2},…, Y_{n}) \leq \pi(\left | p_{1}-M \right | \geq (\epsilon/2\sqrt{2}) |Y_{1}, Y_{2},…, Y_{n}) + \pi(\left | M-p_{1}^{0} \right | \geq (\epsilon/2\sqrt{2}) |Y_{1}, Y_{2},…, Y_{n})$$

Where in the RHS, the first term goes to zero due to Markov Inequality, and the second term goes also to zero because $M$ is a consistent estimator of $p_{1}$.

However, how can we extend this to $K>2$??


$\textbf{Thought},$ Since, the $\textit{Dirichlet}$ distribution is a generalization of the $\textit{Beta}$ distribution, and the mean of each componenet $p_{k}$ of the $\textit{Dirichlet}$ is the same with the mean of the marginized distribution of $p_{k}$ which is a $\textit{Beta}$ distribution.

Then would it be sufficient to check consistency for each $p_{k}$ based on the consistency of their marginalized $\textit{Beta}$ distribution, i.e. to check

$$\pi_{Beta}(\left | p_{k}-p_{k}^{0} \right | \geq \epsilon |Y_{1}, Y_{2},…, Y_{n})$$

where $\pi_{Beta}$ is the marginal of the $\textit{Dirichlet}$ posterior.

Best Answer

This holds true for every prior $\pi$, not just a Dirichlet prior. Partition the simplex into sets of diameter $\epsilon/3$. The true parameter $\theta=\theta_0$ is in one of them, call it $Q$, where $\pi(Q)>0$. Let $Q_a$ consist of all points within distance at most $a\epsilon/3$ from $Q$. Then the posterior probability of $Q_{2}^c$ tends to zero a.s on the event that the true parameter is in $Q$.

Proof: (Edit: Adding some detail to the previous sketch.) Write $P(\cdot)=\int P_\theta(\cdot) \,d\pi(\theta)$, with $E$ the corresponding expectation operator. Define $$A_n=\{Y_n/n \in Q_1 \}\,$$ and let $$f_n(Y)=P(\theta \in Q_2^c |Y_1,...,Y_n)$$ be the posterior probability of $Q_2^c$ given the first $n$ samples. Then $$E[f_n(Y) 1_{A_n}]=P[\{\theta \in Q_2^c\} \cap A_n]= \int_{Q_2^c} P_\theta(A_n ) \, d\pi(\theta)$$ tends to zero exponentially fast by Hoeffding. Thus $$ E[\sum_n f_n(Y) 1_{A_n}] <\infty \,,$$ whence $f_n(Y)1_{A_n} \to 0$ almost surely with respect to $P$. But we also have $1_{A_n^c} \to 0$ a.s. with respect to $P( \cdot| \theta \in Q)$ by the strong law of large numbers. We conclude that $$P(f_n(Y) \to 0 | \theta \in Q) =1 \,.$$

Related Question