Solved – Dirichlet Prior for Multinomial

bayesianconjugate-priordirichlet distributionmultinomial-distribution

The Dirichlet function is the conjugate prior of the multinomial.
So the posterior is also Dirichlet given some observations.
If e.g. I observe the counts $X=(10,3,4)$ from 17 trials (10 for class 1, 3 for class 2 and 4 for class 3).
Assuming a non-informative prior, with concentration parameter $\alpha$ equal to $(1.,1.,1.)$, the posterior Dirichlet should have the concentration parameter $\alpha_{\text{post}}=(11.,4.,5.)$ (according to my knowledge).

If I draw lots of samples from a multinomial with such an posterior $\alpha_{\text{post}}$ the resulting maximum likelihood estimator of the probabilities of the samples differs from the mle of the observations mle.
(see code below).

I don't understand why: Is the posterior Dirichlet alpha given by:
$$
(\alpha_0 + X_0 – 1, \alpha_1 + X_1 – 1, , \alpha_2 + X_2 – 1 )
$$
In my simulation that seems to be the correct answer.

import numpy as np
import scipy.stats
#observations:    
x = np.array([ 10., 3., 4.])

# posterior concentration parameters for Dirichlet
# with uninformative prior, i.e. concentration parameters [1,1,1]
a = x + 1
a = x

# simulate 
nb_samples = 200000
d = scipy.stats.dirichlet.rvs(a, nb_samples)
c_ = np.ndarray(nb_samples)
# without a loop possible??
for i in range(nb_samples):
    c_[i] = np.random.choice(3, 1, p=d[i])

nbs = float(nb_samples) 
counts = np.bincount(c_.astype(int))
print "concentration parameter of the Dirichlet:", a
print "simulation sample probabilities (mle)", counts /nbs
print "observation mle probabilities:", x/float(x.sum())


>concentration parameter of the Dirichlet: [ 11.   4.   5.]
>simulation sample probabilities (mle) [ 0.550255  0.198655  0.25109 ]
>observation mle probabilities: [ 0.58823529  0.17647059  0.23529412]

>concentration parameter of the Dirichlet: [ 10.   3.   4.]
>simulation sample probabililities (mle) [ 0.58836  0.17733  0.23431]
>observation mle probabilities: [ 0.58823529  0.17647059  0.23529412]

Best Answer

I do not think this has anything to do with a wrong definition of the Dirichlet prior or posterior: simply, when $$(x_1,\ldots,x_k)\sim\mathcal{D}(\alpha_1,...,\alpha_k)$$ the mean is given by $$\mathbb{E}[(x_1,\ldots,x_k)]=\dfrac{(\alpha_1,...,\alpha_k)}{\sum_{i=1}^k\alpha_i}$$ which explains for the discrepancy with the MLE.

>c(11.,4.,5.)/sum(c(11.,4.,5.))
[1] 0.55 0.20 0.25
> c(10.,3.,4.)/sum(c(10.,3.,4.))
[1] 0.5882353 0.1764706 0.2352941

If instead you use the mode of the Dirichlet distribution, $$(x_1^\text{mode},\ldots,x_k^\text{mode})=\dfrac{(\alpha_1-1,...,\alpha_k-1)}{\sum_{i=1}^k\alpha_i-k}$$ which recovers the MLE. And this makes complete sense because the MAP is the MLE under a flat prior.