Solved – Posterior distribution for multinomial parameter

bayesianestimationkernel-smoothingprobabilitysampling

(topic moved from maths.stackexchange.com)

I'm currently developing an application integrating a probabilistic inference engine for Bayesian Networks. The Bayesian Network integrates some form of "model uncertainty", where the conditional probability distributions (CPDs) of specific variables in the network have uncertain parameters which should be estimated from data. The distributions with these uncertain parameters are all defined as multinomials on a small set of alternative events.

I would like to use a Bayesian approach to estimate these parameters, starting from an initial prior, and progressively narrowing down their spread given the data recorded by the application.

Unfortunately, the data is a bit difficult to deal with, since it consists of mostly "soft evidence", so the parameter estimation doesn't seem to have an easy analytic solution such as a direct update of Dirichlet counts. So I thought it would be more appropriate to use sampling methods (starting with a simple, importance sampling algorithm). The idea is simply to perform sampling to compute, for each parameter set $\mathbf{θ}$ corresponding to an uncertain multinomial, the posterior distribution $P(\mathbf{θ}|evidence)$, and update the parameter distribution accordingly. Note that I need to infer the full posterior probability distribution and not simply some point estimates, since I want to keep the "model uncertainty" as part of my probabilistic model.

My question is quite simple: once I have collected the samples from this posterior distribution, how do I use these statistics to extract a full probability distribution over the possible parameters? Since I'm dealing with multinomials, the parameters are themselves defined as a vector, so the distribution should have be a multivariate, continuous distribution, with the additional constraint that the parameters must obey the usual axioms of probability theory.

I was thinking of using multivariate Kernel Density Estimation (KDE) to extract this parameter distribution out of my samples, but I don't know whether that's a good idea or not — both in terms of mathematical correctness, and in terms of efficiency. For instance, would the resulting distribution still satisfy the probability axions (e.g. probabilities summing up to 1.0)?

It seems to me that I'm dealing with a quite standard problem (estimating a parameter distribution of a multinomial via sampling), but I haven't found any answer so far. What's your opinion? Should I use KDE, another method, or even forget about the whole idea of reestimating the density function and work directly with the samples?


@Reply to Neil G: The problem is that, in case we have uncertain evidence, the posterior distribution over the parameters might not have the same form as the prior. In case of hard evidence, we know for instance that, if the prior parameter distribution is a Dirichlet (the conjugate prior of a multinomial), the posterior will also be a Dirichlet.

But it doesn't seem to be true anymore if we handle uncertain evidence — at least according to my calculations (please correct me if my reasoning is wrong). I'll take a simple example inspired by the "cherry/lime flavor" from Russel & Norvig (chapter 20).

Let's assume we want to know the distribution of candies with lime or cherry flavors in a bag. In a Bayesian approach, we can add a parameter node $\theta$ describing this distribution:

ex1

Assuming the parameter $\theta$ follows a Beta distribution $\text{beta[a,b]}(\theta)$, we can easily calculate the posterior given the evidence provided by a data point, say that a particular candy is $cherry$:

$P(\theta|D_1=cherry) = \alpha \ P(D_1=cherry|\theta) \ P(\theta)$
$P(\theta|D_1=cherry) = \alpha' \ \theta \ \text{beta[a,b]}(\theta) = \alpha' \ \theta \ \theta^{a-1} \ (1-\theta)^{b-1}$
$P(\theta|D_1=cherry) = \alpha' \ \ \text{beta[a+1,b]}(\theta)$

So here we see that the posterior is still a Beta distribution (the same reasoning would hold if the distribution had been a Dirichlet, of course).

Now, imagine that instead of the "hard" evidence of a particular candy being cherry of lime, we only had a soft evidence, for instance that a particular candy is cherry with $p=0.9$, and lime with $p=0.1$. Graphically, this would be represented as:

ex1

where we have the evidence $o=true$. If we now calculate the posterior distribution:

$P(\theta|o=true) = \alpha \ P(\theta) \ \sum_{F} P(o =true | F) P(F | \theta)$
$P(\theta|o=true) = \alpha \ \text{beta[a,b]}(\theta) \ \left[0.9 \theta + 0.1 (1-\theta) \right]$
$P(\theta|o=true) = \alpha \ \text{beta[a,b]}(\theta) \ \left[0.8 \theta + 0.1 \right]$

And the problem here is that, as far as I can see, the posterior distribution is not a Beta distribution anymore (but rather a linear combination of Beta distributions). Updating the hyperparameters $a$ and $b$ with weighted counts (as you suggested) would give the wrong results here, as it is quite obvious that
$ \text{beta[a,b]}(\theta) \ \left[0.8 \theta + 0.1 \right]$ does not lead to $\text{beta[a+0.8,b+0.1]}(\theta)$.

So as I see it, the posterior distribution after observing the uncertain evidence does not follow a standard parametric form. That's why I think that updating counts wouldn't work (even if they're weighted). But I may have missed something?

Best Answer

Unfortunately, the data is a bit difficult to deal with, since it consists of mostly "soft evidence", so the parameter estimation doesn't seem to have an easy analytic solution such as a direct update of Dirichlet counts.

— why does that follow? Why not just scale the counts based on the amount of evidence supporting each one?

If you have samples of the posterior distribution, why can't you use the sufficient statistics to turn those samples into a maximum likelihood distribution?


Update after your recent edits:

The likelihood on $\theta$ given $o$ is true has density

\begin{align} T_1(x) \propto 0.9 x + 0.1(1-x), \end{align}

and false has density, say

\begin{align} T_2(x) \propto 0.2 x + 0.8(1-x). \end{align}

Then, the final density after $\eta_1$ observations of true and $\eta_2$ observations of false is proportional to

\begin{align} T_1(x)^{\eta_1}T_2(x)^{\eta_2} \end{align}

which is nevertheless an exponential family, although not a Beta distribution as you rightly point out.

Related Question