Solved – Using empirical priors in PyMC

bayesianmarkov-chain-montecarloposteriorpriorpymc

I'm using PyMC to sample the posterior distribution and I've run into a roadblock with using priors from samples, not models. My situation is as follows:

  • I have some empirical data for a parameter $z$ from which I compute a probability distribution $p(z)$. There is no known model/parametrization for the distribution of $z$. All I have are empirical values. It is known that $z$ is bounded between 0 and 30.
  • I have some new observations $x$ and I compute the likelihood $p(x|z)$, also empirically.
  • I wish to find the posterior distribution updated with the new data above. (This posterior becomes the new prior for the next set of observations. Rinse and repeat.)

For example, consider this:

import numpy as np
import matplotlib.pyplot as plt

# Ignore the fact that I'm using a mixture model. For all practical
# purposes, I do not know how this is generated. 
old_data = np.array([3 * np.random.randn(1000) + 20, 
                     3 * np.random.randn(1000) + 5,
                     4 * np.random.randn(1000) + 10])

new_data = np.array([4 * np.random.randn(50) + 8, 
                     2 * np.random.randn(50) + 17])

plt.hist(filter(lambda x: 0 <= x <= 30, old_data.flatten()), 
         bins=range(0, 30), normed=True, alpha=0.5)
plt.hist(filter(lambda x: 0 <= x <= 30, new_data.flatten()), 
         bins=range(0, 30), normed=True, alpha=0.5)
plt.legend(['p(z)', 'p(x|z)'])

Coming to PyMC, all existing sources I've found (for instance, this chapter from Bayesian for hackers) uses a normal distribution for the observations (observations = pm.Normal("obs", center_i, tau_i, value=data, observed=True)) and uniform and normal prior distributions for the precisions and means respectively.

I'm not sure how to assert in PyMC that my prior is this distribution here and not a model. I've also tried using the @Stochastic decorator with observed=True, but I don't think I fully understand it. Besides, I still can't seem to figure out a way to avoid specifying a model.

Am I fundamentally misunderstanding the purpose of an MCMC library? If so, how should I proceed?

All I really want is to update my prior belief with the new observations, but I don't think the solution is as simple as multiplying the two histograms (and normalizing).

Best Answer

If you already have a prior $p(\theta)$ and a likelihood $p(x|\theta)$, then you can easily find the posterior $p(\theta|x)$ by multiplying these and normalizing: $$p(\theta|x)=\frac{p(\theta)p(x|\theta)}{p(x)}\propto p(\theta)p(x|\theta)$$

https://en.wikipedia.org/wiki/Posterior_probability

The following code demonstrates estimating a posterior represented as a histogram, so it can be used as the next prior:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# using Beta distribution instead on Normal to get finite support
support_size=30
old_data=np.concatenate([np.random.beta(70,20,1000), 
                     np.random.beta(10,40,1000),
                     np.random.beta(80,80,1000)])*support_size
new_data=np.concatenate([np.random.beta(20,10,1000), 
                     np.random.beta(10,20,1000)])*support_size

# convert samples to histograms
support=np.arange(support_size)
old_hist=np.histogram(old_data,bins=support,normed=True)[0]
new_hist=np.histogram(new_data,bins=support,normed=True)[0]

# obtain smooth estimators from samples
soft_old=gaussian_kde(old_data,bw_method=0.1)
soft_new=gaussian_kde(new_data,bw_method=0.1)

# posterior histogram (to be used as a prior for the next batch)
post_hist=old_hist*new_hist
post_hist/=post_hist.sum()

# smooth posterior
def posterior(x):
    return soft_old(x)*soft_new(x)/np.sum(soft_old(x)*soft_new(x))*x.size/support_size

x=np.linspace(0,support_size,100)

plt.bar(support[:-1],old_hist,alpha=0.5,label='p(z)',color='b')
plt.bar(support[:-1],new_hist,alpha=0.5,label='p(x|z)',color='g')
plt.plot(x,soft_old(x),label='p(z) smoothed',lw=2)
plt.plot(x,soft_new(x),label='p(x|z) smoothed',lw=2)
plt.legend(loc='best',fontsize='small')
plt.show()

plt.bar(support[:-1],post_hist,alpha=0.5,label='p(z|x)',color='r')
plt.plot(x,soft_old(x),label='p(z) smoothed',lw=2)
plt.plot(x,soft_new(x),label='p(x|z) smoothed',lw=2)
plt.plot(x,posterior(x),label='p(z|x) smoothed',lw=2)
plt.legend(loc='best',fontsize='small')
plt.show()

enter image description here enter image description here

If, however, you want to combine your empirical prior with some MCMC models, I suggest you take a look at PyMC's Potential, one of its main applications is "soft data". Please update your question if you need an answer targeted towards that.

Related Question