Solved – pymc3: acceptance probabilities and divergencies after tuning

logisticpymcregression

I coded two models in pymc3, which I thought are quite simple.

Logistic Regression

The first is a logistic regression in an experiment that models correct and wrong answers for specific tasks in a 2×2 factorial design. Here's what it looks like:

with pm.Model() as m1:
    b_spe = pm.Normal('b_spe', 0, 5)
    b_noi = pm.Normal('b_noi', 0, 5)
    b_sn = pm.Normal('b_sn', 0, 5)
    sigma_sub = pm.HalfCauchy('sigma_sub', 1)
    sigma_run = pm.HalfCauchy('sigma_run', 1)

    a = pm.Normal('a', 0, 5)

    a_sub = pm.Normal('a_sub', 0, sigma_sub, shape=len(d['sub'].unique()))
    a_run = pm.Normal('a_run', 0, sigma_run, shape=len(d['run'].unique()))    

    p = pm.invlogit(a + a_sub[d["sub"].values] + a_run[d["run"].values] + b_spe*d.speech 
                    + b_noi * d.noise + b_sn * d.speech * d.noise)
    hits = pm.Binomial('hits', p=p, n=d.total, observed=d.hit)

    trace_m1 = pm.sample(5000, tune=5000, njobs=4, target_accept=.99)

After sampling I get the following warning:

Sampling 4 chains: 100%|██████████| 40000/40000 [03:38<00:00,
37.86draws/s] The acceptance probability does not match the target. It is 0.8942185942514685, but should be close to 0.8. Try to increase the
number of tuning steps. The number of effective samples is smaller
than 25% for some parameters.

Here are the traceplots:

Rhat(a) = 1.0002822268704348

The traceplots look fine, at least to me, and r-hat is very close to 1. Is there a problem with the model? What can I do to fix it?

Linear Regression

This is related to the experiment above, but I use fMRI data that I model with Student's T-distributions for the longer tails. It's still a 2×2 factorial design, and I also model the covariance between intercepts and slopes of each subject using an LKJ covariance matrix. Furthermore, the model is off-center, inspired by the models of chapter 13.23 from Statistical Rethinking.

with pm.Model() as m2:
    sd_dist = pm.HalfCauchy.dist(beta=2)
    pchol_sub = pm.LKJCholeskyCov('pchol_sub', eta=4, n=n_dim, sd_dist=sd_dist)
    chol = pm.expand_packed_triangular(n_dim, pchol_sub, lower=True)    

    a = pm.Normal('a', 0, 1)

    sigma_sub = pm.HalfCauchy('sigma_sub', 1)
    a_sub = pm.Normal('a_sub', 0, sigma_sub, shape=n_subs)  

    # includes beta for spe, noi, sn
    b_spe = pm.StudentT('b_spe', nu=3, mu=0, lam=1)
    b_noi = pm.StudentT('b_noi', nu=3, mu=0, lam=1)
    b_sn = pm.StudentT('b_sn', nu=3, mu=0, lam=1)

    # per subject
    b_s_sub = tt.dot(chol, pm.StudentT('b_s_sub', nu=3, mu=0, lam=1, shape=(1, n_subs)))
    b_n_sub = tt.dot(chol, pm.StudentT('b_n_sub', nu=3, mu=0, lam=1, shape=(1, n_subs)))
    b_sn_sub = tt.dot(chol, pm.StudentT('b_sn_sub', nu=3, mu=0, lam=1, shape=(1, n_subs)))

    A = a + a_sub[d['sub'].values]
    B_s = b_spe + b_s_sub[0, d['sub'].values]
    B_n = b_noi + b_n_sub[0, d['sub'].values]
    B_sn = b_sn + b_sn_sub[0, d['sub'].values]

    sigma = pm.HalfCauchy('sigma', 2)
    lam = sigma ** -2
    nu = pm.Exponential('ν_minus_one', 1/29.) + 1
    mu = pm.Deterministic('mu', A + B_s * d.speech + B_n * d.noise + B_sn * d.speech * d.noise)
    psc = pm.StudentT('psc', nu=nu, mu=mu, lam=lam, observed=d.psc_c)

    trace_m2 = pm.sample(5000, tune=5000, njobs=4, target_accept=0.99)

Here are the warnings/erros I received:

Sampling 4 chains: 100%|██████████| 40000/40000 [04:55<00:00,
135.45draws/s] There were 2 divergences after tuning. Increase target_accept or reparameterize. ERROR:pymc3:There were 2
divergences after tuning. Increase target_accept or reparameterize.
The acceptance probability does not match the target. It is
0.7183184493954338, but should be close to 0.8. Try to increase the number of tuning steps.

There are other repeating warnings that pretty much say the same thing. For each chain I get warnings.

Below are the trace plots.

If I zoom in between 2000:3000 I get funky looking traces:

Here's an example pairplot between sigma_sub and a_sub_7. The red dots are the divergent points. I was inspired by the post by Thomas Wiecki.

Rhat(a) = 1.0014072110125143

Again, the traces look fine to me (except the zoomed in part) and r-hat is very close to 1. What am I doing wrong?

The data is quite noisy for this model, and the sample size is 17, not huge.

Would really appreciate some help here 🙂
Thanks!

Edit:
Using pm.sample(..., nuts_kwargs=dict(target_accept=0.95)) got rid of the divergencies.

Best Answer

You might have better luck on our discourse: https://discourse.pymc.io/

A couple of notes: You need to use: pm.sample(..., nuts_kwargs=dict(target_accept=0.95)) instead to have the target_accept be applied (this behavior is changed in line with what you tried in the next release). That might already solve your problem.

If not, these problems do not look too terrible so you're probably fine. If you still wanted to investigate this further I would look where exactly the divergences appear. Arviz has some better plots for this, see here for one example: https://arviz-devs.github.io/arviz/examples/plot_parallel.html or https://arviz-devs.github.io/arviz/examples/plot_pair.html. Basically you want to find the dimensions where the divergences cluster. That tells you where the problem is.

Related Solutions

Solved – Defining the Stochastic and Deterministic variables with pymc3

Custom probability densities can be included using pymc.DensityDist(). For the gradient computation to work though, you have to supply your density as a theano function. For example, see https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/custom_dists.py:

# This model was presented by Jake Vanderplas in his blog post about
# comparing different MCMC packages
# http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/
#
# While at the core it's just a linear regression, it's a nice
# illustration of using Jeffrey priors and custom density
# distributions in PyMC3.
#
# Adapted to PyMC3 by Thomas Wiecki
import matplotlib.pyplot as plt
import numpy as np
import pymc
import theano.tensor as T
np.random.seed(42)
theta_true = (25, 0.5)
xdata = 100 * np.random.random(20)
ydata = theta_true[0] + theta_true[1] * xdata
# add scatter to points
xdata = np.random.normal(xdata, 10)
ydata = np.random.normal(ydata, 10)
data = {'x': xdata, 'y': ydata}
with pymc.Model() as model:
alpha = pymc.Uniform('intercept', -100, 100)
# Create custom densities
beta = pymc.DensityDist('slope', lambda value: -1.5 * T.log(1 + value**2), testval=0)
sigma = pymc.DensityDist('sigma', lambda value: -T.log(T.abs_(value)), testval=1)
# Create likelihood
like = pymc.Normal('y_est', mu=alpha + beta * xdata, sd=sigma, observed=ydata)
start = pymc.find_MAP()
step = pymc.NUTS(scaling=start) # Instantiate sampler
trace = pymc.sample(10000, step, start=start)

If you can't express you density as a theano compute graph, you have to create a blackbox theano expression using the new as_op decorator. For example: hhttps://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/disaster_model_theano_op.py. Note that this requires Theano from current master branch:

from pymc import *
import theano.tensor as t
from numpy import arange, array, ones, concatenate, empty
from numpy.random import randint
__all__ = ['disasters_data', 'switchpoint', 'early_mean', 'late_mean', 'rate',
'disasters']
# Time series of recorded coal mining disasters in the UK from 1851 to 1962
disasters_data = array([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
years = len(disasters_data)

#here is the trick
@theano.compile.ops.as_op(itypes=[t.lscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
def rateFunc(switchpoint,early_mean, late_mean):
    ''' Concatenate Poisson means '''
    out = empty(years)
    out[:switchpoint] = early_mean
    out[switchpoint:] = late_mean
    return out

with Model() as model:
    # Prior for distribution of switchpoint location
    switchpoint = DiscreteUniform('switchpoint', lower=0, upper=years)
    # Priors for pre- and post-switch mean number of disasters
    early_mean = Exponential('early_mean', lam=1.)
    late_mean = Exponential('late_mean', lam=1.)
    # Allocate appropriate Poisson rates to years before and after current
    # switchpoint location
    idx = arange(years)
    #theano style:
    #rate = switch(switchpoint >= idx, early_mean, late_mean)
    #non-theano style
    rate = rateFunc(switchpoint, early_mean, late_mean)
    # Data likelihood
    disasters = Poisson('disasters', rate, observed=disasters_data)
    # Initial values for stochastic nodes
    start = {'early_mean': 2., 'late_mean': 3.}
    # Use slice sampler for means
    step1 = Slice([early_mean, late_mean])
    # Use Metropolis for switchpoint, since it accomodates discrete variables
    step2 = Metropolis([switchpoint])
    # njobs>1 works only with most recent (mid August 2014) Thenao version:
    # https://github.com/Theano/Theano/pull/2021
    tr = sample(1000, tune=500, start=start, step=[step1, step2],njobs=1)
traceplot(tr)

We might make this as_op part a bit easier in the future.

Solved – Pymc3 – Sampling from a categorical distribution

Use the BinaryMetropolis step method with p=np.array([0.25, 0.75]) and it shoud work.