There is no reason to state that MCMC sampling is the "best" Monte Carlo method! Usually, it is on the opposite worse than iid sampling, at least in terms of variance of the resulting Monte Carlo estimators$$\frac{1}{T}\sum_{t=1}^T h(X_t)$$Indeed, while this average converges to the expectation $\mathbb{E}_{\pi}[h(X)]$ when $\pi$ is the stationary and limiting distribution of the Markov chain $(X_t)_t$, there are at least two drawbacks in using MCMC methods:
- The chain needs to "reach stationarity", meaning that it needs to forget about its starting value $X_0$. In other words, $t$ must be "large enough" for $X_t$ to be distributed from $\pi$. Sometimes "large enough" may exceed by several orders of magnitude the computing budget for the experiment.
- The values $X_t$ are correlated, leading to an asymptotic variance that involves
$$\text{var}_\pi(X)+2\sum_{t=1}^\infty\text{cov}_\pi(X_0,X_t)$$ which generally exceeds $\text{var}_\pi(X)$ and hence requires longer simulations than for an iid sample.
This being said, MCMC is very useful for handling settings where regular iid sampling is impossible or too costly and where importance sampling is quite difficult to calibrate, in particular because of the dimension of the random variable to be simulated.
However, sequential Monte Carlo methods like particle filters may be more appropriate in dynamical models, where the data comes by bursts that need immediate attention and may even vanish (i.e., cannot be stored) after a short while.
In conclusion, MCMC is a very useful (and very much used) tool to handle complex settings where regular Monte Carlo solutions fail.
Yes. Unlike what other answers state, 'typical' machine-learning methods such as nonparametrics and (deep) neural networks can help create better MCMC samplers.
The goal of MCMC is to draw samples from an (unnormalized) target distribution $f(x)$. The obtained samples are used to approximate $f$ and mostly allow to compute expectations of functions under $f$ (i.e., high-dimensional integrals) and, in particular, properties of $f$ (such as moments).
Sampling usually requires a large number of evaluations of $f$, and possibly of its gradient, for methods such as Hamiltonian Monte Carlo (HMC).
If $f$ is costly to evaluate, or the gradient is unavailable, it is sometimes possible to build a less expensive surrogate function that can help guide the sampling and is evaluated in place of $f$ (in a way that still preserves the properties of MCMC).
For example, a seminal paper (Rasmussen 2003) proposes to use Gaussian Processes (a nonparametric function approximation) to build an approximation to $\log f$ and perform HMC on the surrogate function, with only the acceptance/rejection step of HMC based on $f$. This reduces the number of evaluation of the original $f$, and allows to perform MCMC on pdfs that would otherwise too expensive to evaluate.
The idea of using surrogates to speed up MCMC has been explored a lot in the past few years, essentially by trying different ways to build the surrogate function and combine it efficiently/adaptively with different MCMC methods (and in a way that preserves the 'correctness' of MCMC sampling). Related to your question, these two very recent papers use advanced machine learning techniques -- random networks (Zhang et al. 2015) or adaptively learnt exponential kernel functions (Strathmann et al. 2015) -- to build the surrogate function.
HMC is not the only form of MCMC that can benefit from surrogates. For example, Nishiara et al. (2014) build an approximation of the target density by fitting a multivariate Student's $t$ distribution to the multi-chain state of an ensemble sampler, and use this to perform a generalized form of elliptical slice sampling.
These are only examples. In general, a number of distinct ML techniques (mostly in the area of function approximation and density estimation) can be used to extract information that might improve the efficiency of MCMC samplers. Their actual usefulness -- e.g. measured in number of "effective independent samples per second" -- is conditional on $f$ being expensive or somewhat hard to compute; also, many of these methods may require tuning of their own or additional knowledge, restricting their applicability.
References:
Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian integrals." Bayesian Statistics 7. 2003.
Zhang, Cheng, Babak Shahbaba, and Hongkai Zhao. "Hamiltonian Monte Carlo Acceleration using Surrogate Functions with Random Bases." arXiv preprint arXiv:1506.05555 (2015).
Strathmann, Heiko, et al. "Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families." Advances in Neural Information Processing Systems. 2015.
Nishihara, Robert, Iain Murray, and Ryan P. Adams. "Parallel MCMC with generalized elliptical slice sampling." Journal of Machine Learning Research 15.1 (2014): 2087-2112.
Best Answer
The ideal Monte Carlo algorithm uses independent successive random values. In MCMC, successive values are not independant, which makes the method converge slower than ideal Monte Carlo; however, the faster it mixes, the faster the dependence decays in successive iterations¹, and the faster it converges.
¹ I mean here that the successive values are quickly "almost independent" of the initial state, or rather that given the value $X_n$ at one point, the values $X_{ń+k}$ become quickly "almost independent" of $X_n$ as $k$ grows; so, as qkhhly says in the comments, "the chain don’t keep stuck in a certain region of the state space".
Edit: I think the following example can help
Imagine you want to estimate the mean of the uniform distribution on $\{1, \dots, n\}$ by MCMC. You start with the ordered sequence $(1, \dots, n)$; at each step, you chose $k>2$ elements in the sequence and randomly shuffle them. At each step, the element at position 1 is recorded; this converges to the uniform distribution. The value of $k$ controls the mixing rapidity: when $k=2$, it is slow; when $k=n$, the successive elements are independent and the mixing is fast.
Here is a R function for this MCMC algorithm :
Let’s apply it for $n = 99$, and plot the successive estimation of the mean $\mu = 50$ along the MCMC iterations:
You can see here that for $k=2$ (in black), the convergence is slow; for $k=50$ (in blue), it is faster, but still slower than with $k=99$ (in red).
You can also plot an histogram for the distribution of the estimated mean after a fixed number of iterations, eg 100 iterations:
You can see that with $k=2$ (M1), the influence of the initial value after 100 iterations only gives you a terrible result. With $k=50$ it seems ok, with still greater standard deviation than with $k=99$. Here are the means and sd: