Suppose I want to sample from the joint distribution $p(X, Y)$, where $X$ is a random variable and $Y = f(X)$ where $f$ is a known function of $X$. However, sampling from $p(X,Y)$ directly is hard. Could I use Gibbs sampling and sample from the conditionals $x^{(k)} \sim p(X\mid Y=y^{(k-1)})$ with $y^{(k-1)}=f(x^{(k-1)})$ and $y^{(k)} \sim p(Y\mid X=x^{(k)}) = \delta(y-f(x^{(k)}))$ for $k = 1, 2, \dots$? Would this Gibbs sampler converge?
Solved – Sampling from the joint distribution p(x,y) when y = f(x)
gibbsjoint distributionsampling
Related Solutions
Those distributions you call "marginal" are not marginal. They are conditional distributions because you wrote $x \mid y$. The marginal distribution of $X$, for example, is necessarily independent of the value of $Y$.
To see how the conditional distribution is gamma, all you have to do is write $$f_{X \mid Y}(x) = \frac{f_{X,Y}(x,y)}{f_Y(y)} \propto f_{X,Y}(x,y).$$ That is to say, the conditional distribution is proportional to the joint distribution, appropriately normalized. So we have $$f_{X \mid Y}(x) \propto x^2 e^{-x(y^2+4)},$$ completely ignoring any factors that are not functions of $x$. Then we recognize that the gamma distribution has density $$f_S(s) \propto s^{a-1} e^{-bs},$$ so the choice of shape $a = 3$ and rate $b = y^2+4$ demonstrates that the conditional distribution $X \mid Y \sim \operatorname{Gamma}(a = 3, b = y^2+4)$.
The conditional distribution of $Y \mid X$ is done in a similar fashion. Just ignore constants of proportionality: $$f_{Y \mid X}(y) \propto e^{-(x+1)y^2+2y},$$ but this one requires us to complete the square to get it to look like a normal density: $$-(x+1)y^2 + 2y = (x+1)\left(-\left(y - \tfrac{1}{x+1}\right)^2 \right) + \tfrac{1}{x+1},$$ and after exponentiating and removing the $e^{1/(x+1)}$ factor, comparing this against $$f_W(w) \propto e^{-(w-\mu)^2/(2\sigma^2)},$$ we see that we have a normal density with mean $\mu = 1/(x+1)$ and variance $\sigma^2 = 1/(2(x+1))$.
Now, if you wanted the marginal distributions, you would need to integrate: $$f_X(x) = \int_{y=-\infty}^\infty f_{X,Y}(x,y) \, dy,$$ for example. And as you can see, this expression will not be a function of $Y$. The difference is that if I simulated realizations of ordered pairs $(X_i, Y_i)$ from the joint distribution, the marginal density for $X$ would be what you would see if I only told you the values of $X_i$. The conditional distribution of $X$ given $Y = y$ would be what you would see if I only gave you the $X_i$ for which the corresponding $Y_i$ was equal to $y$.
In an ideal world, sampling from $p_1(x_1)$ and then from $p_{1|2}(x_2|x_1)$ is a correct way to simulate from the joint. In case one of these distributions is unavailable, simulating a single step of Metropolis-Within-Gibbs targeting $p_{1|2}(\cdot|x_1^{(t-1)})$ and a single step of Metropolis-Within-Gibbs targeting $p_{2|1}(\cdot|x_2^{(t)})$ is correct. Note that since $p_1(\cdot)$ is available, the MCMC chain starts at stationarity. Note also that, if $p_1(\cdot)$ is available and $p_{2|1}(\cdot|x_1^{(t-1)})$ is available, then (a) $p_{1|2}(\cdot|x_2^{(t)})$ is available up to a constant and (b) $p_1(\cdot)$ can be used as a proposal in the Metropolis-Within-Gibbs step.
In the event where $p_1(\cdot)$ is not available but generational [simulations can be produced from $p_1(\cdot)$] and $p_{1|2}(\cdot|x_1^{(t-1)})$ is available, then making Metropolis proposals $x_1'$ from $p_1(\cdot)$ and accepting these with Metropolis acceptance rate $$\dfrac{p_{1|2}(x_1'|x_2^{(t)})}{p_{1|2}(x_1^{(t-1)}|x_2^{(t)})} \dfrac{p_{1}(x_1^{(t-1)})}{p_{1}(x_1')}=\dfrac{p_{2|1}(x_2^{(t)}|x_1')}{p_{2|1}(x_2^{(t)}|x_1^{(t-1)})}$$ can be implemented.
Best Answer
(This answer does not really help with the Gibbs sampler much, but points at something else you could do. I do think the Gibbs sampler would converge).
Recall that, $$p(X,Y) = p(Y | X) p(X)\, . $$
Now if you can sample from the marginal of $X$, then $X$ is the linchpin variable. You can use exact sampling methods if it is a known distribution, or you could use MCMC. This sampler should intuitively converge faster than the Gibbs sampler because the Markov chain is only present in $X$ here, whereas in the Gibbs sampler, the Markov chain samples both $(X, Y)$.