Solved – Bayesian model selection in PyMC3

bayesianmodel selectionpymc

I am using PyMC3 to run Bayesian models on my data.

I am new to Bayesian modeling but according to some blogs posts, Wikipedia and QA from this site, it seems to be a valid approach to use Bayes factor and BIC criterion to be able to choose what model best represent my data (the one which generate my data).

To compute the Bayes factor, I need the relative likelihood for models I want to compare. It's maybe a bit confuse to me but I think there is two way to get the likelihood (correct me if I am wrong):

  • the algebraic way when model are simple: see Wikipedia example Bayes factor page

  • the numeric way: this is what does PyMC3 with the MCMC algorithms

How can I access to the likelihood and so compare my models in PyMC3? I found model.logp method which according to the doc is the "log probability density function". Can I use that to get the likelihood?

Bonus question: when two models are compared the ratio between both likelihood is computed. What happen if you want to compare several models?

A concrete PyMC3 example would be very helpful!

Best Answer

You can compute the likelihood of a model indeed using model.logp(). As input, it requires a point. For example, the BEST model from the examples directory I can do:

np.exp(model.logp({'group1_mean': 0.1, 
                   'group2_mean': 0.2, 
                   'group1_std_interval': 1., 
                   'group2_std_interval': 1.2, 
                   'nu_minus_one_log': 1}))

Note that this model is using transformed variables, so I have to supply these. You could then take the exp() of that and use it inside a numerical integrator, for example as provided by scipy.integrate. The problem is that even with only 5 parameters, this will be very slow.

Bayes Factors are generally very difficult to compute because you have to integrate over the complete parameter space. There are some ideas to using MCMC samples for that. See this post, and especially the comment section for more information: https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/ The case for BIC is unfortunately similar.

If you really want to compute the Bayes Factor, you can also look at the Savage Dickey Ratio test (see e.g. http://drsmorey.org/bibtex/upload/Wagenmakers:etal:2010.pdf), but it's application is limited.

I suppose that you're trying to do model comparison which is a field with many opinions and solutions (some hard to implement, like BFs). One measure that is very easy compute is the Deviance Information Criterion. It has its downsides, although some of them can be remedied (see http://onlinelibrary.wiley.com/doi/10.1111/rssb.12062/abstract). Unfortunately we haven't ported the code pymc3 yet, but it'd be pretty easy (see here for the pymc2 implementation: https://github.com/pymc-devs/pymc/blob/895c24f62b9f5d786bce7ac4fe88edb4ad220364/pymc/MCMC.py#L410).

Kruschke favors the approach to just build the full model and let it tell you which parameters matter. You could also build variable selection into the model itself (see e.g. http://arxiv.org/pdf/math/0505633.pdf).

Finally, for a much more complete treatment, see this recent blog post: http://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/