The problem
I want fit the model parameters of a simple 2-Gaussian mixture population. Given all the hype around Bayesian methods I want to understand if for this problem Bayesian inference is a better tool that traditional fitting methods.
So far MCMC performs very poorly in this toy example, but maybe I just overlooked something. So let see the code.
The tools
I will use python (2.7) + scipy stack, lmfit 0.8 and PyMC 2.3.
A notebook to reproduce the analysis can be found here
Generate the data
First let generate the data:
from scipy.stats import distributions
# Sample parameters
nsamples = 1000
mu1_true = 0.3
mu2_true = 0.55
sig1_true = 0.08
sig2_true = 0.12
a_true = 0.4
# Samples generation
np.random.seed(3) # for repeatability
s1 = distributions.norm.rvs(mu1_true, sig1_true, size=round(a_true*nsamples))
s2 = distributions.norm.rvs(mu2_true, sig2_true, size=round((1-a_true)*nsamples))
samples = np.hstack([s1, s2])
The histogram of samples
looks like this:
a "broad peak", the components are hard to spot by eye.
Classical approach: fit the histogram
Let's try the classical approach first. Using lmfit it's easy to define a 2-peaks model:
import lmfit
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
model.set_param_hint('p1_center', value=0.2, min=-1, max=2)
model.set_param_hint('p2_center', value=0.5, min=-1, max=2)
model.set_param_hint('p1_sigma', value=0.1, min=0.01, max=0.3)
model.set_param_hint('p2_sigma', value=0.1, min=0.01, max=0.3)
model.set_param_hint('p1_amplitude', value=1, min=0.0, max=1)
model.set_param_hint('p2_amplitude', expr='1 - p1_amplitude')
name = '2-gaussians'
Finally we fit the model with the simplex algorithm:
fit_res = model.fit(data, x=x_data, method='nelder')
print fit_res.fit_report()
The result is the following image (red dashed lines are fitted centers):
Even if the problem is kind of hard, with proper initial values and constraints the models converged to quite a reasonable estimate.
Bayesian approach: MCMC
I define the model in PyMC in hierarchical fashion. centers
and sigmas
are the priors
distribution for the hyperparameters representing the 2 centers and 2 sigmas of the 2 Gaussians. alpha
is the fraction of the first population and the prior distribution is here a Beta.
A categorical variable chooses between the two populations. It is my understanding that this variable needs to be same size as the data (samples
).
Finally mu
and tau
are deterministic variables that determine the parameters of the Normal distribution (they depend on the category
variable so they randomly switch between the two values for the two populations).
sigmas = pm.Normal('sigmas', mu=0.1, tau=1000, size=2)
centers = pm.Normal('centers', [0.3, 0.7], [1/(0.1)**2, 1/(0.1)**2], size=2)
#centers = pm.Uniform('centers', 0, 1, size=2)
alpha = pm.Beta('alpha', alpha=2, beta=3)
category = pm.Categorical("category", [alpha, 1 - alpha], size=nsamples)
@pm.deterministic
def mu(category=category, centers=centers):
return centers[category]
@pm.deterministic
def tau(category=category, sigmas=sigmas):
return 1/(sigmas[category]**2)
observations = pm.Normal('samples_model', mu=mu, tau=tau, value=samples, observed=True)
model = pm.Model([observations, mu, tau, category, alpha, sigmas, centers])
Then I run the MCMC with quite long number of iterations (1e5, ~60s on my machine):
mcmc = pm.MCMC(model)
mcmc.sample(100000, 30000)
However the results are very odd. For example $\alpha$ trace (the fraction of the first population) tends to 0 instead to converge to 0.4 and has a very strong autocorrelation:
Also the centers of the Gaussians do not converge as well. For example:
As you see in the prior choice, I tried to "help" the MCMC algorithm using a Beta distribution for the prior population fraction $\alpha$. Also the prior distributions for the centers and sigmas are quite reasonable (I think).
So what's going on here? Am I doing something wrong or MCMC is not suitable for this problem?
I understand that MCMC method will be slower, but the trivial histogram fit seems to perform immensely better in resolving the populations.
Best Answer
The problem is caused by the way that PyMC draws samples for this model. As explained in section 5.8.1 of the PyMC documentation, all elements of an array variable are updated together. For small arrays like
center
this is not a problem, but for a large array likecategory
it leads to a low acceptance rate. You can see the acceptance rate viaThe solution suggested in the documentation is to use an array of scalar-valued variables. In addition, you need to configure some of the proposal distributions since the default choices are bad. Here is the code that works for me:
The
proposal_sd
andproposal_distribution
options are explained in section 5.7.1. For the centers, I set the proposal to roughly match the standard deviation of the posterior, which is much smaller than the default due to the amount of data. PyMC does attempt to tune the width of the proposal, but this only works if your acceptance rate is sufficiently high to begin with. Forcategory
, the defaultproposal_distribution = 'Poisson'
which gives poor results (I don't know why this is, but it certainly doesn't sound like a sensible proposal for a binary variable).