Solved – Maximum a posteriori on Multinomial distribution with a Dirichlet prior can result in negative probabilities

dirichlet distributionmaximum likelihoodmultinomial-distributionoptimizationposterior

I am doing a maximum a posteriori (MAP) estimation of a Multinomial distribution $M(c_1,\dots,c_n|p_1,\dots,p_n)$ with a Dirichlet prior $D(p_1,\dots,p_n|\alpha_1,\dots,\alpha_n)$. The experimental counts for the MAP estimate are $(c_1,\dots,c_n)$.

My understanding is that MAP is equivalent to $\text{argmax}(M(\vec{c}|\vec{p})D(\vec{p}|\vec{\alpha}))$ over $\vec{p}$ for fixed experimental data $\vec{c}$ and a fixed prior $\vec{\alpha}$. The solution seems to be

$p_i = \frac{c_i+\alpha_i-1}{\sum_{i=1}^{n}(c_i+\alpha_i-1)}$.

However this can be negative (because the naive solution using just a Lagrange multiplier does not impose the $p_i>0$ constraints). For instance, for a category $i$ with zero counts $c_i=0$ and a prior $\alpha_i=0.5$ we get $p_i<0$.

Is there a known analytic solution for MAP that ensures the multinomial probabilities are never negative? Do I need to do it numerically instead?

Or maybe I am completely misunderstanding how the MAP is to be performed? Any suggestions or appropriate literature would be welcome.

Best Answer

Just to reiterate you have $n$-outcome count data $\vec{c} = (c_1,...,c_n)$ and I will assume this is from a total of $N$ shots. The heirarchical model you have described then is the following

$$\vec{c}|\vec{p}\sim \text{Mulit}(N,\vec{p}),\\ \vec{p}|\vec{\alpha} \sim \text{Dir}(\vec{\alpha}). $$

Now, due to the Dirichlet distribution being the conjugate prior for a multinomial likelihood, the posterior is also a Dirichlet distribution. In particular$$\vec{p}|\vec{c},\vec{\alpha}\sim \text{Dir}(\vec{c}+\vec{\alpha}).$$ The mode of this posterior distribution is, as you correctly pointed out$$\vec{p}_{\text{MAP}}=\frac{\vec{c}+\vec{\alpha}-1}{\sum_{k=1}^{n}(c_k+\alpha_k-1) }\\=\frac{\vec{c}+\vec{\alpha}-1}{N-n+\sum_{k=1}^{n}\alpha_k} $$ but this only holds for $\alpha_i>1$. If your prior parameters $\vec{\alpha}$ do not satisfy this then you will need to resort to numerics, yes.

However, the MAP estimate is only one particular choice of point estimate and possibly not the best for this problem. Another is the posterior mean which, for this posterior, is given by $$ \vec{p}_{\text{CM}} = \frac{\vec{c}+\vec{\alpha}}{N+\sum_{k=1}^{n}\alpha_k}$$ and instead holds for all $\alpha_i>0$. This gives you a closed form point estimate for your unknown $\vec{p}$ for any choice of prior, although it is not the MAP.

Related Solutions

Solved – Draw a multinomial distribution from a Dirichlet distribution

Let's take a step back from the Dirichlet distribution and multinomial distribution and consider a slightly simpler set of models.

The binomial distribution describes the number of "successes" $y$ one expects to observe in a number of trials $n$. The binomial model has several key properties:

The binomial model is dichotomous.
Each event in a binomial model has a probability of "success" $\theta$.
The trials in a binomial model are independent: previous successes neither increase nor decrease the probability of future successes.

This is all well and good when all of our data are subject to some kind of rigorous controls, so that we know that across all of our observations, the $\theta$ for each "batch" of trials is the same. But more realistically, we have reason to believe that $\theta$ for trial $y_i$ is different than for trial $y_j$. One way to accomplish this is to treat each $\theta$ as if it were drawn separately from a beta distribution. The beta distribution has several useful properties

It is a probability distribution of probabilities: that is, it has support on the unit interval
It is conjugate to the binomial model, which simplifies computations (I have intentionally omitted an extended discussion of conjugacy in this post and how that can simplify the process of drawing values from these distributions because I feel that is only tangentially related to your question.)

One method to draw values from a beta-binomial model is the following set of steps:

Draw a value $\tilde{\theta}$ from the beta distribution.
Draw a value $y$ from the binomial distribution with $\theta=\tilde{\theta}$.

This covers the simple case of a dichotomous, binary outcome. But you've asked about Dirichlet distributions. Happily, the Dirichlet distribution is actually the same thing as a beta distribution when the dimension is 2. In higher dimensions, it is analogous to the beta distribution.

Likewise, the multinomial distribution is the higher-dimensional analogue to the binomial distribution. In the case of a dichotomous outcome, the binomial distribution is the multinational outcome.

Drawing values from a multinomial distribution with a Dirichlet distribution over the probabilities of outcomes is accomplished in a very similar way:

Draw a vector of probabilities from the Dirichlet distribution.
Use that vector of probabilities to draw a vector of outcomes from the multinomial distribution.

Solved – Dirichlet Prior for Multinomial

I do not think this has anything to do with a wrong definition of the Dirichlet prior or posterior: simply, when $$(x_1,\ldots,x_k)\sim\mathcal{D}(\alpha_1,...,\alpha_k)$$ the mean is given by $$\mathbb{E}[(x_1,\ldots,x_k)]=\dfrac{(\alpha_1,...,\alpha_k)}{\sum_{i=1}^k\alpha_i}$$ which explains for the discrepancy with the MLE.

>c(11.,4.,5.)/sum(c(11.,4.,5.))
[1] 0.55 0.20 0.25
> c(10.,3.,4.)/sum(c(10.,3.,4.))
[1] 0.5882353 0.1764706 0.2352941

If instead you use the mode of the Dirichlet distribution, $$(x_1^\text{mode},\ldots,x_k^\text{mode})=\dfrac{(\alpha_1-1,...,\alpha_k-1)}{\sum_{i=1}^k\alpha_i-k}$$ which recovers the MLE. And this makes complete sense because the MAP is the MLE under a flat prior.

Best Answer

Related Solutions

Solved – Draw a multinomial distribution from a Dirichlet distribution

Solved – Dirichlet Prior for Multinomial

Related Question