There seem to be some misconceptions about what the Metropolis-Hastings (MH) algorithm is in your description of the algorithm.
First of all, one has to understand that MH is a sampling algorithm. As stated in wikipedia
In statistics and in statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult.
In order to implement the MH algorithm you need a proposal density or jumping distribution $Q(\cdot\vert\cdot)$, from which it is easy to sample. If you want to sample from a distribution $f(\cdot)$, the MH algorithm can be implemented as follows:
- Pick a initial random state $x_0$.
- Generate a candidate $x^{\star}$ from $Q(\cdot\vert x_0)$.
- Calculate the ratio $\alpha=f(x^{\star})/f(x_0)$.
- Accept $x^{\star}$ as a realisation of $f$ with probability $\alpha$.
- Take $x^{\star}$ as the new initial state and continue sampling until you get the desired sample size.
Once you get the sample you still need to burn it and thin it: given that the sampler works asymptotically, you need to remove the first $N$ samples (burn-in), and given that the samples are dependent you need to subsample each $k$ iterations (thinning).
An example in R can be found in the following link:
http://www.mas.ncl.ac.uk/~ndjw1/teaching/sim/metrop/metrop.html
This method is largely employed in Bayesian statistics for sampling from the posterior distribution of the model parameters.
The example that you are using seems unclear to me given that $f(x)=ax$ is not a density unless you restrict $x$ on a bounded set. My impression is that you are interested on fitting a straight line to a set of points for which I would recommend you to check the use of the Metropolis-Hastings algorithm in the context of linear regression. The following link presents some ideas on how MH can be used in this context (Example 6.8):
Robert & Casella (2010), Introducing Monte Carlo Methods with R, Ch. 6, "Metropolis–Hastings Algorithms"
There are also lots of questions, with pointers to interesting references, in this site discussing about the meaning of likelihood function.
Another pointer of possible interest is the R package mcmc
, which implements the MH algorithm with Gaussian proposals in the command metrop()
.
1) You could think about this method as a random walk approach. When the proposal distribution $x \mid x^t \sim N( x^t, \sigma^2)$, it is commonly referred to as the Metropolis Algorithm. If $\sigma^2$ is too small, you will have a high acceptance rate and very slowly explore the target distribution. In fact, if $\sigma^2$ is too small and the distribution is multi-modal, the sampler may get stuck in a particular mode and won't be able to fully explore the target distribution. On the other hand, if $\sigma^2$ is too large, the acceptance rate will be too low. Since you have three dimensions, your proposal distribution would have a covariance matrix $\Sigma$ which will likely require different variances and covariances for each dimension. Choosing an appropriate $\Sigma$ may be difficult.
2) If your proposal distribution is always $N(\mu, \sigma^2)$, then this is the independent Metropolis-Hastings algorithm since your proposal distribution does not depend on your current sample. This method works best if your proposal distribution is a good approximation of the target distribution you wish to sample from. You are correct that choosing a good normal approximation can be difficult.
Neither method's success should depend on the starting value of the sampler. No matter where you start, the Markov chain should eventually converge to the target distribution. To check convergence, you could run several chains from different starting points and perform a convergence diagnostic such as the Gelman-Rubin convergence diagnostic.
Best Answer
I don't have a great example off the top of my head, but MH is easy compared to direct sampling whenever the parameter's prior is not conjugate with that parameter's likelihood. In fact this is the only reason I have ever seen MH preferred. A toy example is that $p \sim \text{Beta}(\alpha, \beta)$, and you wanted to have (independent) priors $\alpha, \beta \sim \text{Gamma}()$. This is not conjugate and you would need to use MH for $\alpha$ and $\beta$.
This presentation gives an example of a Poisson GLM which uses MH for drawing the GLM coefficients.
If you don't already know, it might be worth noting that direct sampling is just the case of MH when we always accept the drawn value. So whenever we can direct sample we should, to avoid having to tune our proposal distribution.