The problem with your description of the Metropolis-Hastings algorithm is that your notation does not distinguish between the probability densities in the actual problem you are trying to solve, and the proposal density used in the algorithm. Your notation also fails to capture the fact that we are trying to simulate from the posterior distribution, but we only have a kernel of this distribution. A better description of the algorithm, which makes these distinctions in the notation, is as follows:
You start in a situation where you do not know the posterior $p(\theta|D)$, but you do know a kernel of this distribution $K(\theta|D) \propto p(\theta|D)$. You want to simulate values from the posterior. In the MH algorithm you start at an arbitrary parameter value $\theta_0$ and simulate using the following recursive scheme (which is a Markov chain):
- We generate a proposed value $\theta'_{t+1}$ from the proposal density $g(\theta'_{t+1}|\theta_{t})$.
- For the proposed value, we define the acceptance ratio:
$$A(\theta'_{t+1} | \theta_t) \equiv \frac{K(\theta'_{t+1}|D)}{K(\theta_{t}|D)} \cdot \frac{g(\theta_{t} | \theta'_{t+1})}{g(\theta'_{t+1} | \theta_t)}.$$
- With probability $\min (A(\theta'_{t+1} | \theta_t), 1)$ we accept the proposed value and set $\theta_{t+1} = \theta'_{t+1}$. Otherwise we reject the proposed value and set $\theta_{t+1} = \theta_{t}$.
It can be shown that this Markov chain has stationary distribution $p(\theta|D)$, which is the posterior distribution of interest. Note that this is true even though the algorithm only uses a kernel of the distribution. We can therefore rely on the limiting properties of Markov chains to simulate from this posterior distribution. Usually this involves generating a small amount of 'burn-in' iterations followed by a series of auto-correlated simulations from the limiting stationary distribution. We can also rely on ergodic theorems to estimate the true posterior moments of functions of the parameter from the corresponding sample moments from the Markov chain.
Special case - symmetric proposal distribution: In many applications of the MH algorithm it is common to use a proposal density that is symmetric, in the sense that:
$$g(\theta'|\theta) = g(\theta|\theta') \quad \text{for all } \theta, \theta'.$$
(Note that a sufficient condition for this is that the density value depends on the parameters only through the norm $||\theta' - \theta||$.) In this special case the acceptance ratio reduces to:
$$A(\theta'_{t+1} | \theta_t) \equiv \frac{K(\theta'_{t+1}|D)}{K(\theta_{t}|D)}.$$
Now that we have a clearer explanation of the actual workings of the algorithm, I will try to answer your specific questions. (For consistency, I will translate your questions into notation that is consistent with my explanation of the algorithm.) Your Question 4 is unclear to me (there is no cost function in the algorithm so I don't know what you're referring to here), but I will answer the other three questions.
Question 1) Are $g(\theta'_{t+1} | \theta_t)$ and $g(\theta_t | \theta'_{t+1})$ different? What if I am using a uniform proposal distribution.
In the case where you use a proposal distribution that is symmetric (in the sense described above) the two proposal densities (with the argument and conditioning parameter switched) will be the same. Symmetry occurs in the case where you use the uniform proposal density that is centred around the conditioning value:
$$g(\theta' | \theta) \propto \mathbb{I}(|\theta' - \theta| \leqslant t).$$
In this case, switching the terms in the proposal density does not alter the value (i.e., they are not different). If you are using a uniform proposal density that is not centred around the conditioning value then this will not hold.
Question 2) If I am using a uniform proposal distribution, could I write the acceptance ratio without the ratio of proposal densities?
Assuming your uniform proposal distribution is centred around the conditioning parameter (and thus symmetric in the above sense), yes you can.
Question 3) If I use a normal distribution, will I still have symmetry of the proposal distribution?
Symmetry occurs if you use a normal distribution with mean equal to the conditioning parameter and variance independent of this parameter:
$$g(\theta' | \theta) = \text{N}(\theta' |\theta, \Sigma) \propto \exp \Big( -\frac{1}{2} (\theta' - \theta)^\text{T} \Sigma^{-1} (\theta' - \theta) \Big).$$
This is a single-step (independent) proposal, namely generating simultaneously $(X^\star,Y^\star,Z^\star)$ from the joint proposal with density
$$p(x^\star,y^\star,z^\star)= p(z^\star|\alpha,\beta)p(x^\star|z^\star,
\boldsymbol{\gamma}_x)p(y|z^\star,\boldsymbol{\gamma}_y)$$
Therefore the acceptance probability to move from $(x^-,y^-,z^-)$ to $(x^\star,y^\star,z^\star)$ in this independent Metropolis-Hastings algorithm is
$$1 \wedge \dfrac{\pi(x^\star,y^\star,z^\star)}{\pi(x^-,y^-,z^-)}\times
\dfrac{p(x^-,y^-,z^-)}{p(x^\star,y^\star,z^\star)}$$
Best Answer
When $$\alpha = \min \left( 1,\frac{f(y|\theta^{'})f(\theta^{'}) q(\theta|\theta^{'})}{f(y|\theta)f(\theta) q(\theta^{'}|\theta)} \right)$$ involves an intractable likelihood function $f(y|\cdot)$ that cannot be computed, several (exact) alternatives are available:
the intractable part of $f(y|\theta)$ may also appear in $q(\theta|\theta')$ and hence cancels in the ratio. This is the idea of the auxiliary variable device of Møller et al. (2006). Also pursued by Murray et al. (2012). They mostly address the setup of doubly intractable distributions where the likelihood function $f(y|\theta)$ involves a multiplicative factor $\mathfrak c(\theta)$ that is itself intractable.
the intractable likelihood $f(y|\theta)$ may unbiasedly estimated by a random variable $\xi(y,\theta)$, even up to a normalising constant: $$\mathbb E[\xi(y,\theta)]=\alpha(y)f(y|\theta)$$ where $\alpha(y)$ may be unknown / intractable. This is the idea of pseudo-marginal MCMC of Andrieu & Roberts (2009).
Demarginalising $y$ into $(y,z)$ and $f(y|\theta)$ into $\tilde f(y,z|\theta)$ such that $$\int_{\mathbb Z} \tilde f(y,z|\theta)\,\text dz=f(y|\theta)$$and $\tilde f(y,z|\theta)$ tractable is a more general auxiliary variable method, where the augmented $(\theta,z)$ is simulated conditional on $y$ through an MCMC method. When using a Gibbs sampler, the ratio $\alpha$ may then be replaced at iteration $t$ by $$\tilde\alpha = \min \left( 1,\frac{\tilde f(y,z^t|\theta^{'})f(\theta^{'}) q(\theta^t|\theta^{'})}{f(y,z^t|\theta^t)f(\theta^t) q(\theta^{'}|\theta^t)} \right)$$ equivalent to $$\tilde\alpha = \min \left( 1,\frac{\tilde f(y|\theta^{'},z^t)f(\theta^{'}) q(\theta^t|\theta^{'})}{f(y|\theta^t,z^t)f(\theta^t) q(\theta^{'}|\theta^t)} \right)$$ which is a special case of 1.
If none of these (related) approaches can be used (in a sufficiently efficient manner), then a approximate approach is to resort to ABC (Approximate Bayesian computation).