Solved – Pymc3 – Sampling from a categorical distribution

categorical datapymc

I've been experimenting with PyMC3 – I've used it for building regression models before, but I want to better understand how to deal with categorical data.

However, I think I'm misunderstanding how the Categorical distribution is meant to be used in PyMC. In order to test out using the distribution, I'm using the Categorical distribution to simulate a biased coin. When I run the following code:

“`

import pymc3

with pymc3.Model() as model:
    category = pymc3.Categorical(name='category',
                                 p=np.array([0.25]))
    trace = pymc3.sample(20, step=pymc3.Metropolis())
print(trace['category'])

“`

I expect the trace to consist of numbers from the set {0, 1}, where the values are sampled from a Bernoulli distribution with p = 0.25.

However, the code above prints the following:
[ 0 -1 -2 -2 -2 -3 -4 -4 -4 -5 -5 -6 -7 -7 -6 -8 -8 -7 -6 -6]

It seems like I am misunderstanding something, as these numbers are not even in the support of the distribution that I am attempting to simulate.

Am I mistaken about the format that p takes? Am I accessing the results incorrectly? Help me understand what's going on here. Thanks in advance for the help!

Best Answer

Use the BinaryMetropolis step method with p=np.array([0.25, 0.75]) and it shoud work.

Related Solutions

Solved – Random effect on the intercept in models with categorical predictor

Brief: You can't say the random intercepts are the same in both models. They aren't estimated. If they were they would be intercepts rather than random intercepts. On the other hand the variance may be the same, since this is estimated.

Longer: I think I need to know Latex and have more time to give a reasonable answer. But hopefully someone can build on this.

The univariate random effects model with a single random intercept is the following, where $b_j$ is the random intercept and $\epsilon$ is the error term:

$$ y_{ij}=\beta_{0}+\beta_{1}x1_{ij}+b_{j}+\epsilon_{ij}. $$

If $x1$ is a dummy variable and if $x2=0$, the intercept is: $$ \beta_{0} + b_j $$

If $x2=1$, the intercept is: $$ \beta_{0} + \beta_{1} + b_j $$ Note that changing the coding has no effect on $b_j$, the random intercept.

The random intercept can be considered a latent variable that is not estimated along with the fixed parameters $\beta_1, \beta_2, \dots$, but whose variance is estimated together with the variance of the error term.

Practically speaking, this is why it is such a pain to get predictions for individuals out of mixed-effects models. What is a BLUP or EBLUP?

Should i understand that including a random intercept will allow me to control for variability between groups in ONLY the reference category of the predictor, and that I need to include random 'slopes' to control for variability between groups in ALL categories of the predictor?

Random intercept should be sufficient I think.

Random intercept model: the overall level of the response is allowed to vary between clusters after controlling for covariates (it does not only apply to the reference group).
Random slope/coefficient model: often used with longitudinal data. In addition to the above, this allows the effects of the covariates to vary between clusters.

Solved – 2-Gaussian mixture model inference with MCMC and PyMC

The problem is caused by the way that PyMC draws samples for this model. As explained in section 5.8.1 of the PyMC documentation, all elements of an array variable are updated together. For small arrays like center this is not a problem, but for a large array like category it leads to a low acceptance rate. You can see the acceptance rate via

print mcmc.step_method_dict[category][0].ratio

The solution suggested in the documentation is to use an array of scalar-valued variables. In addition, you need to configure some of the proposal distributions since the default choices are bad. Here is the code that works for me:

import pymc as pm
sigmas = pm.Normal('sigmas', mu=0.1, tau=1000, size=2)
centers = pm.Normal('centers', [0.3, 0.7], [1/(0.1)**2, 1/(0.1)**2], size=2)
alpha  = pm.Beta('alpha', alpha=2, beta=3)
category = pm.Container([pm.Categorical("category%i" % i, [alpha, 1 - alpha]) 
                         for i in range(nsamples)])
observations = pm.Container([pm.Normal('samples_model%i' % i, 
                   mu=centers[category[i]], tau=1/(sigmas[category[i]]**2), 
                   value=samples[i], observed=True) for i in range(nsamples)])
model = pm.Model([observations, category, alpha, sigmas, centers])
mcmc = pm.MCMC(model)
# initialize in a good place to reduce the number of steps required
centers.value = [mu1_true, mu2_true]
# set a custom proposal for centers, since the default is bad
mcmc.use_step_method(pm.Metropolis, centers, proposal_sd=sig1_true/np.sqrt(nsamples))
# set a custom proposal for category, since the default is bad
for i in range(nsamples):
    mcmc.use_step_method(pm.DiscreteMetropolis, category[i], proposal_distribution='Prior')
mcmc.sample(100)  # beware sampling takes much longer now
# check the acceptance rates
print mcmc.step_method_dict[category[0]][0].ratio
print mcmc.step_method_dict[centers][0].ratio
print mcmc.step_method_dict[alpha][0].ratio

The proposal_sd and proposal_distribution options are explained in section 5.7.1. For the centers, I set the proposal to roughly match the standard deviation of the posterior, which is much smaller than the default due to the amount of data. PyMC does attempt to tune the width of the proposal, but this only works if your acceptance rate is sufficiently high to begin with. For category, the default proposal_distribution = 'Poisson' which gives poor results (I don't know why this is, but it certainly doesn't sound like a sensible proposal for a binary variable).

Best Answer

Related Solutions

Solved – Random effect on the intercept in models with categorical predictor

Solved – 2-Gaussian mixture model inference with MCMC and PyMC

Related Question