Solved – Metropolis-Hastings integration – why isn’t the strategy working

metropolis-hastingsmonte carlonumerical integrationsimulation

Assume I have a function $g(x)$ that I want to integrate
$$ \int_{-\infty}^\infty g(x) dx.$$
Of course assuming $g(x)$ goes to zero at the endpoints, no blowups, nice function. One way that I've been fiddling with is to use the Metropolis-Hastings algorithm to generate a list of samples $x_1, x_2, \dots, x_n$ from the distribution proportional to $g(x)$, which is missing the normalization constant
$$N = \int_{-\infty}^{\infty} g(x)dx $$
which I will call $p(x)$, and then computing some statistic $f(x)$ on these $x$'s:
$$ \frac{1}{n} \sum_{i=0}^n f(x_i) \approx \int_{-\infty}^\infty f(x)p(x)dx.$$

Since $p(x) = g(x)/N$, I can substitute in $f(x) = U(x)/g(x)$ to cancel $g$ from the integral, resulting in an expression of the form
$$ \frac{1}{N}\int_{-\infty}^{\infty}\frac{U(x)}{g(x)} g(x) dx = \frac{1}{N}\int_{-\infty}^\infty U(x) dx.$$
So provided that $U(x)$ integrates to $1$ along that region, I should get the result $1/N$, which I could just take the reciprocal to get the answer I want. Therefore I could take the range of my sample (to most effectively use the points) $r = x_\max – x_\min $ and let $U(x) = 1/r$ for each sample I've drawn. That way $U(x)$ evaluates to zero outside of the region where my samples aren't, but integrates to $1$ in that region. So if I now take the expected value, I should get:
$$E\left [\frac{U(x)}{g(x)}\right ] = \frac{1}{N} \approx \frac{1}{n} \sum_{i=0}^n \frac{U(x)}{g(x)}. $$

I tried testing this in R for the sample function $g(x) = e^{-x^2}$. In this case I do not use Metropolis-Hastings to generate the samples but use the actual probabilities with rnorm to generate samples (just to test). I do not quite get the results I am looking for. Basically the full expression of what I'd be calculating is:
$$\frac{1}{n(x_{\max} – x_\min)} \sum_{i=0}^n \frac{1}{ e^{-x_i^2}}. $$
This should in my theory evaluate to $1/\sqrt{\pi}$. It gets close but it certainly does not converge in the expected way, am I doing something wrong?

ys = rnorm(1000000, 0, 1/sqrt(2))
r = max(ys) - min(ys)
sum(sapply(ys, function(x) 1/( r * exp(-x^2))))/length(ys)
## evaluates to 0.6019741. 1/sqrt(pi) = 0.5641896

Edit for CliffAB

The reason I use the range is just to easily define a function that is non-zero over the region where my points are, but that integrates to $1$ on the range $[-\infty, \infty]$. The full specification of the function is:
$$ U(x) = \begin{cases} \frac{1}{x_\max – x_\min} & x_\max > x > x_\min \\ 0 & \text{otherwise.} \end{cases} $$
I did not have to use $U(x)$ as this uniform density. I could have used some other density that integrated to $1$, for example the probability density
$$ P(x) = \frac{1}{\sqrt{\pi}} e^{-x^2}.$$
However this would have made summing the individual samples trivial i.e.
$$ \frac{1}{n} \sum_{i=0}^n \frac{P(x)}{g(x)} = \frac{1}{n} \sum_{i=0}^n \frac{e^{-x_i^2}/\sqrt{\pi}}{e^{-x_i^2} } = \frac{1}{n} \sum_{i=0}^n \frac{1}{\sqrt{\pi}} = \frac{1}{\sqrt{\pi}}.$$

I could try this technique for other distributions that integrate to $1$. However, I would still like to know why it doesn't work for a uniform distribution.

Best Answer

This is a most interesting question, which relates to the issue of approximating a normalising constant of a density $g$ based on an MCMC output from the same density $g$. (A side remark is that the correct assumption to make is that $g$ is integrable, going to zero at infinity is not sufficient.)

In my opinion, the most relevant entry on this topic in regard to your suggestion is a paper by Gelfand and Dey (1994, JRSS B), where the authors develop a very similar approach to find$$\int_\mathcal{X} g(x) \,\text{d}x$$when generating from $p(x)\propto g(x)$. One result in this paper is that, for any probability density $\alpha(x)$ [this is equivalent to your $U(x)$] such that $$\{x;\alpha(x)>0\}\subset\{x;g(x)>0\}$$the following identity $$\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}$$ shows that a sample from $p$ can produce an unbiased evaluation of $1/N$ by the importance sampling estimator $$\hat\eta=\frac{1}{n}\sum_{i=1}^n \dfrac{\alpha(x_i)}{g(x_i)}\qquad x_i\stackrel{\text{iid}}{\sim}p(x)$$ Obviously, the performances (convergence speed, existence of a variance, &tc.) of the estimator $\hat\eta$ do depend on the choice of $\alpha$ [even though its expectation does not]. In a Bayesian framework, a choice advocated by Gelfand and Dey is to take $\alpha=\pi$, the prior density. This leads to $$\dfrac{\alpha(x)}{g(x)} = \dfrac{1}{\ell(x)}$$ where $\ell(x)$ is the likelihood function, since $g(x)=\pi(x)\ell(x)$. Unfortunately, the resulting estimator $$\hat{N}=\dfrac{n}{\sum_{i=1}^n1\big/\ell(x_i)}$$ is the harmonic mean estimator, also called the worst Monte Carlo estimator ever by Radford Neal, from the University of Toronto. So it does not always work out nicely. Or even hardly ever.

Your idea of using the range of your sample $(\min(x_i),\max(x_i))$ and the uniform over that range is connected with the harmonic mean issue: this estimator does not have a variance if only because because of the $\exp\{x^2\}$ appearing in the numerator (I suspect it could always be the case for an unbounded support!) and it thus converges very slowly to the normalising constant. For instance, if you rerun your code several times, you get very different numerical values after 10⁶ iterations. This means you cannot even trust the magnitude of the answer.

A generic fix to this infinite variance issue is to use for $\alpha$ a more concentrated density, using for instance the quartiles of your sample $(q_{.25}(x_i),q_{.75}(x_i))$, because $g$ then remains lower-bounded over this interval.

When adapting your code to this new density, the approximation is much closer to $1/\sqrt{\pi}$:

ys = rnorm(1e6, 0, 1/sqrt(2))
r = quantile(ys,.75) - quantile(ys,.25)
yc=ys[(ys>quantile(ys,.25))&(ys<quantile(ys,.75))]
sum(sapply(yc, function(x) 1/( r * exp(-x^2))))/length(ys)
## evaluates to 0.5649015. 1/sqrt(pi) = 0.5641896

We discuss this method in details in two papers with Darren Wraith and with Jean-Michel Marin.