Solved – MCMC convergence

bayesianmarkov-chain-montecarlo

I'm trying to fit a Bayesian model. This model has many parameters (about 150).

I run MCMC (10000 iters) with a thinning period to avoid correlation problem and with the hope it would save me some time in terms of computation. So I obtain the sample that let me to compute Monte Carlo estimates. I knew that a proper inference implies checking for MCMC convergence. I perform this task with a graphic assessment (I know that there are more formal approaches, this is a first attempt) with trace plot. I realised that there are few parameters that doesn't show at all any convergence. What does It mean?

When I try to see what the effective sample size was, I obtain for these parameters very low values (like 6) and this is a sign of correlated samples.

I don't know how to go on.

UPDATE: This is my model in BUGS. it's a dirichlet regression, with response variable y made up with 4 categories( so K=4). i'm applying this to a dataset related to violent crime in state of usa usa. So my thought was to use a parameter beta for each observation i( the states in usa)
Example of data:

     state  violent_crime    murder  rape robbery aggravated_assault   median_income
1   Alabama       20727    276  2005    4701              13745     42917 
2   alaska        4684     41   771     629               3243      70898

Model:

model
{
    for (i in 1:n) {
        beta1[i,1:K]~dmnorm(zero[],t[1:K,1:K])

        Y[i,1:K] ~ddirich(alpha[i,1:K])
        for ( k in 1:4)
        {
        xb[i,k]<-exp(beta0[k]+beta1[i,k]*x[i])
        }
        for( k in 1:3)
        {
        mu[i,k]<-xb[i,k]/sum(xb[i,1:4])
        }
        mu[i,K]<-1/sum(xb[i,1:4])
        for( k in 1:K)
        {
        alpha[i,k]<-mu[i,k]*phi
        }

        }
for(i in 1:K)
     { 
 for (j in 1:K) 
       {  B[i,j] <- 0.00001*equals(i,j)
      }}
#prior
beta0[1:K]~dmnorm(zero[],B[,])

    for( k in 1:K)
    {
     zero[k] <- 0
    }

#hyper-prior
t[1:K,1:K]~dwish(R[1:K,1:K],4)
phi~dgamma(0.0001,0.0001)

}

Best Answer

It means that your chain most likely did not converge. By this I mean you should be wary of the entire chain, not just worry about the dimensions with low effective number of samples.

Solutions

Burn-in (discarding early part of the chain) - see also this question

Sometimes a low effective number of samples is just because the chain started in a low-probability region, and found the basin of convergence (the high probability region, or typical set) only later on. I do not recommend that now you just play with burn-in, since with only one chain it's hard to tell what's going on.

Instead, you might want to run a few chains in parallel, as opposed to a single long chain (see MacKay's book, Chapter 29). With multiple chains it is usually easier to spot lack of convergence, although there are different schools of thought here on the number of chains (I usually do 3 or 4).

Regarding burn-in, there are also several opinions. Some people, such as Geyer, say that it's pointless, as long as you are sure that you start in a high probability region:

Any point you don't mind having in a sample is a good starting point.

However, this is easier said than done in 150 dimensions. State-of-the-art statistical packages burn-in as much as 50% of the chain (see Stan), but one of the reasons for such a long burn-in is that it is also used to adapt some of the sampler parameters.

In your case, I definitely recommend that you initialize your sampler from a good guess (e.g., somewhere around the mode is better than nothing, although the mode itself is likely not in the typical set), and do some burn-in since in high dimensions it's hard to know where the probability mass resides (not the same as the probability density). In my opinion, it's better to burn-in more than less (at worst, you are simply throwing away some effective samples, whereas if you do not burn-in you might be keeping samples that are not representative of the target distribution).

Sample more

Simple enough, start where your MCMC chain ended (or from some other good guess, as mentioned above), discard the previous chain, and take more samples overall.

Better tune your sampler

I don't know which MCMC method you are using, but most samplers have several tunable hyper-parameters (e.g., jump length(s) for Metropolis-Hastings with Gaussian proposal, mass matrix for Hamiltonian Monte Carlo, etc.). Performance of most MCMC samplers heavily depends on the choice of these parameters, so you might have a look at your sampled distribution and figure out whether you can improve them.

Change sampler

Sometimes, the sampler you are using might be not the best choice. For example, if your problem deals with continuous variables with a smooth, differentiable target distribution, you probably want to use some variant of Hamiltonian Monte Carlo (state of the art is NUTS, as implemented in Stan).

Change parameterization

As suggested by @Björn in the comments, you might want to try a different parameterization of your model. Simple reparameterizations amount to a linear or nonlinear transformation of your variables; the former, for example, to reduce correlation between variables, and the latter, for example, to get rid of long, fat tails in the posterior which may be hard to sample from. More complex reparameterizations include the addition of auxiliary variables that might help mixing (e.g., by bridging different regions of the posterior), but this becomes extremely model-dependent.

Related Question