Solved – Markov Chain Monte Carlo (MCMC) with transformed data

bayesianconditional probabilitydata transformationgibbsmarkov-chain-montecarlo

I want to obtain an estimate of a parameter $\Theta$ in a model for a random variable $X$ dependent on $\Theta$ with known but complicated likelihood $L(\Theta|X) = p(X|\Theta)$. $X$ is not directly observable, but
I can observe transformation (also a r.v.) $Y$, which is a function $g$ of $X$ and $\Theta$: $Y = g(X; \Theta)$.
The function $g$ is nonlinear but 1-to-1 in $X$, and nonlinear (possibly not 1-to-1) also in $\Theta$.
I aim at estimating $\Theta$ with MCMC (Gibbs (/MH) sampler); therefore I introduce another latent variable $R$, for which the
densities $p(R|\Theta, X)$ and $p(\Theta|R,X)$ have a simple form (not really Gaussian, but close), and so does the (complete) likelihood $L(\Theta | X, R) = p(X|\Theta, R)$.

To focus, my question is: In the usual setup, is there a Jacobian determinant to be included in the conditional densities, and if so, how is it to be set up? And if not, what could be the issue with the below implementation? So far, I have worked without a Jacobian in light of $g$ being 1-to-1 in $X$.

Here is the pseudo-code which was designed to produce a parameter estimate $\Theta^*$:

choose initial $\Theta^{(0)}$; m=0
while m $\leq$ NMaxIterations do {
(a) re-transform $X^{(m)} = g^{-1}(y; \Theta^{(m)})$
(b) sample
$R^{(m+1)} \sim p(R | \Theta^{(m)}, X^{(m)})$ , and then
$\Theta^{(m+1)} \sim p(\Theta | R^{(m+1)}, X^{(m)})$
(c) m=m+1
}
$\Theta^* = mean((\Theta^{(N_B)}, …, \Theta^{(NMaxIterations)}))$, where $N_B$ is some burn-in period.

Simulation studies (I simulate a realization of $X$ and transform it to produce data $Y$ via $g$ before running the above algorithm) show that if $g$ is the identity, $\Theta^{(m)}$ converges to the true parameter pretty quickly, as hoped for.
However, already for a function $g$ that depends only linearly on $\Theta$, such as $g(x; \Theta) = \Theta x$, the algorithm diverges pretty quickly (Inf or NaN in $\Theta^{(m)}$ for a small $m$).

The argument for step (2a) was so far that, to write the problem in a classic MCMC setup, I suppose formally one has to include also sampling from the latent variable $X$, that is, erase step (2a) above and insert the following as first draw before the sampling of $R^{(m+1)}$ in step (2b):
sample
\begin{eqnarray}
X^{(m)} & \sim & p(X | \Theta^{(m)}, y, R^{(m)}) = \delta(X-g^{-1}(y; \Theta^{(m)}))
\end{eqnarray}
(where I guess, the variable $R^{(m)}$ on which the density $p$ is conditioned upon can be omitted since $X$ does not depend on it), but then this corresponds to nothing else than re-transforming $y$ using the current $\Theta^{(m)}$ to $X^{(m)}$.

I am grateful for any insights. Thank you very much!

Best Answer

While Glen_b's answer is mathematically correct, it may be slightly off the mark with respect to the very unusual setting of the question here. In short, I think that the Jacobian issue may be irrelevant here from a simulation perspective.

First, if you observe $Y$ and if the Jacobian of the transform $Y = g(X; \theta)$ does not depend on $\theta$, the Jacobian vanishes from the full conditionals in the Gibbs sampler and from the Metropolis-Hastings formula. If the Jacobian does depend on $\theta$ as well, then it has to be included in the conditionals and in the Metropolis-Hastings formula.

More central to the point raised by the question, I find the question utterly interesting and my first solution would have been yours, namely to use the Dirac mass transform of $(\theta,y)$ into $x$. However, it does not work as shown by the following toy example where one starts with a normal observation$$x\sim\mathcal{N}(0,\theta^{-2})$$ and an exponential prior on $\theta^2$, $$\theta^2\sim\mathcal{E}(1)$$ It is a standard derivation to show that the posterior distribution on $\theta^2$ [or conditional of $\theta^2$ given $x$ if you prefer] is a Gamma distribution$$\theta^2|x\sim\text{Ga}(3/2,1+x^2/2)$$Now, if one considers the transform $y=\theta x$, then $y|\theta\sim\mathcal{N}(0,1)$, which means that the distribution of $y$ does not depend on $\theta$, hence that $\theta$ and $y$ are independent. In summary, my toy example involves the distributions $$\theta^2\sim\mathcal{E}(1)\quad x\sim\mathcal{N}(0,\theta^{-2})\quad y=\theta x\sim\mathcal{N}(0,1)$$ which means that the posterior on $\theta$ given $y$ is the prior

The R code following your suggested implementation is

y=rnorm(1) #observation
T=1e4 #number of MCMC iterations
the=rep(1,T) #vector of theta^2's
for (t in 2:T){ #MCMC iterations
  #step 2(a)
  #true conditional of x on θ:
  x=y/sqrt(the[t-1]) 
  #step 2(b) with no R
  #true conditional of θ on x:
  the[t]=rgamma(1,shape=1.5,rate=1+.5*x^2)}

leads to a complete lack of fit. Actually, the lack of fit can be amplified by choosing a very large value for $y$.

The intuitive (and theoretically valid) explanation for this lack of fit is that the Gibbs sampler (or equivalently another MCMC sampler) should always condition on $y$, the sole observation. Therefore, when computing $x$ as a deterministic transform of $(y,\theta)$, this has no impact on the distribution of $\theta$ given $x$ and $y$: it is the same as the distribution of $\theta$ given $y$, given the Dirac on $x$.

In conclusion, the correct version of your algorithm is as follows:

choose initial $\Theta^{(0)}$; m=0

while m $\leq$ NMaxIterations do {
(a) transform $X^{(m)} = g^{-1}(y; \Theta^{(m)})$
(b) sample
$R^{(m+1)} \sim p(R | \Theta^{(m)}, y)$ , and then
$\Theta^{(m+1)} \sim p(\Theta | R^{(m+1)}, y)$
(c) m=m+1 }

$\Theta^* = \text{mean}((\Theta^{(N_B)}, ..., \Theta^{(NMaxIterations)}))$, where $N_B$ is some burn-in period.

This clearly means that the completion step 2(a) is unnecessary.

Related Solutions

Solved – Good non-informative priors for estimating the parameters of a Gaussian with MCMC (using PyMC)

Gelman has good advice for setting priors for variance parameters in Bayesian models.

There is too much structure in this model for the data you are trying to fit. In particular, it is not clear why you are modeling mu, rather than just putting a prior on it. The way you have it set up, you are claiming that mu is sampled from another normal with unknown parameters, which is not supported by your data.

Also, the standard deviation (not the variance) should be modeled as a uniform for it to be diffuse.

The following code produces good estimates of the true parameters of the model:

from __future__ import division
import pymc as mc
from pylab import *

# create test data
N = 50
mu = 5
data = randn(N) + mu

# start PyMC variables
mu_0 = mc.Uniform('$\mu_0$', 0, 10)
sigma_0 = mc.Uniform('$\sigma_0$', 0, 2)

data = mc.Normal('data', mu_0, sigma_0**-2, observed=True, value=data)

# sample
mcmc = mc.MCMC([data, mu_0, sigma_0]) 
mcmc.sample(iter=50000, burn=5000)

# plot
figure()

for i, v in enumerate(('$\mu_0$', '$\sigma_0$')):
    x = mcmc.trace(v)[:].reshape(-1)

    subplot(1, 2, i+1)
    hist(x, 50)
    title(v)

enter image description here

Solved – MCMC chain getting stuck

In step 4, you don't have to reject the proposal $x,\theta$ every time its new likelihood is lower; if you do so, you are doing a sort of optimization instead of sampling from the posterior distribution.

Instead, if the proposal is worse then you still accept it with an acceptance probability $a$.

With pure Gibbs sampling, the general strategy to sample this would be:

Gibbs

Iteratively sample: \begin{align} p(x | \theta, y) &\propto p(y | x) p(x |\theta)\\ p(\theta | x, y) &\propto p(x | \theta) p(\theta) \end{align}

Gibbs with Metropolis steps for non-conjugate cases:

If you some of the conditionals above is not a familar distribution (because you are multiplying non-conjugates; this is your case) you can sample with Metropolis Hastings:

From the current $x$, generate some proposal, e.g.: $$ x^* \sim \mathcal{N(x, \sigma)} $$
Accept $x^*$ with probability [1]: $$ a = min \left(1, \frac{p(x^*)}{p(x)}\right) = min \left(1, \frac{p(x^* | \theta, y)}{p(x | \theta, y)}\right) $$

[1] If the proposal distribution wasn't symmetric then there is another multiplying factor.

Appendix: $$ p(\theta | x, y) = \frac{p(y|x)p(x| \theta)p(\theta)} {\int p(y|x)p(x| \theta)p(\theta) \text{d}\theta}= \frac{p(x| \theta)p(\theta)} {\int p(x| \theta)p(\theta) \text{d}\theta} \propto p(x| \theta)p(\theta) $$

Best Answer

Related Solutions

Solved – Good non-informative priors for estimating the parameters of a Gaussian with MCMC (using PyMC)

Solved – MCMC chain getting stuck

Related Question