Understanding importance sampling in Monte Carlo integration

bias-variance tradeoffconfidence intervalimportance-samplingmonte carlosimulation

Introduction

I'm studying importance sampling and I'm trying to figure out by myself, with a couple of examples, what are the main benefits with respect to standard Monte Carlo integration.
I'm not sure if I have correctly understood the theory and if I my examples are correct.
Moreover, I've found an implementation problem which I'd like to discuss.

Background

Consider the following expectation of $f(x):\mathbb{R}^d\mapsto \mathbb{R}$
\begin{equation*}\bar{f}\triangleq \int f(x)\,p(x)\text{ d}x\end{equation*}
here $p(x):\mathbb{R}^d\mapsto \mathbb{R}_{\geq 0}$ is a PDF and the integration domain is intended to be all $\mathbb{R}^d$. Standard Monte Carlo tells that we can estimate $\bar{f}$ with the following empirical average
\begin{equation*}\hat{\bar{f}}_M \triangleq \frac{1}{M} \sum_{i=1}^M f(x_i) \qquad x_i \sim p(x)\end{equation*}
where for all $i$ the particles $x_i$ are selected by sampling $p(x)$. On the other hand, importance sampling tells that we can equivalently write the expectation $\bar{f}$ as
\begin{equation*}\bar{f} = \int \frac{p(x)}{q(x)}\,f(x)\, q(x)\text{ d}x\end{equation*}
where $q(x):\mathbb{R}^d\mapsto \mathbb{R}_{\geq 0}$ is a PDF, referred as proposal, with a support that includes the support of the original PDF $p(x)$. Then, we can apply standard Monte Carlo to this new integral and we get the estimate
\begin{equation*}
\hat{\bar{f}}_M \triangleq \frac{1}{M} \sum_{i=1}^M \frac{p(x_i)}{q(x_i)}\,f(x_i) \qquad x_i \sim q(x)
\end{equation*}

I understand that importance sampling provide two big advantages, which are

  1. if $p(x)$ is an "involved" PDF, then generating the sample $x_i\sim p(x)$ can be challenging. With importance sampling, we sample $q(x)$, which is a degree of freedom. Hence, we can choose as we want how to generate the particles $x_i$ and still obtain a good estimate of $\bar{f}$ (regarldess the fact we are not sampling the underlying PDF $p(x)$ of the given problem)
  2. since $q(x)$ is a degree of freedom, we can cleverly choose in a way that we minimize the variance of the estimate $\hat{\bar{f}}_M$ ($M$ is considered fixed). In particular, we know that the best proposal, which nullify the variance of $\hat{\bar{f}}_M$, is
    \begin{equation*}q^*(x)\triangleq \frac{|f(x)|\,p(x)}{\int |f(x)|\,p(x)\text{ d}x}\end{equation*}
    however, we cannot sample $q^*(x)$ because: (1) it can be challenging to develop a generator following $q^*(x)$; (2) by definition of the problem, we don't know the value of $\int f(x)\,p(x)\text{d}x$, so that – reasonably – even $\int |f(x)|\,p(x)\text{ d}x$ is unknwon. Despite this fact, we can still choose $q(x)$ as an approximation of $q^*(x)$: we can't reach a null variance, but we can reduce it with respect the one that we will obtain with standard Monte Carlo.

Example 1

With this example, I want to see "in action" the first advantage of importance sampling. Consider the following setup with $x=[\xi\,\,\eta]'\in \mathbb{R}^2$ ($'$ is the transpose operator)
\begin{equation*}\begin{aligned}
f(x)&\triangleq \xi\,\eta\\
p(x)&\triangleq [(\lambda_2+r\,\xi)(\lambda_1+r\,\eta)-r]\exp\left[-(\lambda_1\,\xi+\lambda_2\,\eta+r\,\xi\,\eta)\right]\,\mathbb{1}_{\xi,\eta \geq 0}(\xi,\eta)
\end{aligned}
\end{equation*}

where

  • $\lambda_1,\lambda_2 >0$ and $0\leq r \leq 1$ are given parameters. In this example I consider $\lambda_1\triangleq 1$, $\lambda_2 \triangleq 2$, $r\triangleq 1/2$.
  • $\mathbb{1}_{\xi,\eta}(\xi,\eta)$ is the indicator function of positive quadrant $\xi \geq 0$ and $\eta \geq 0$.

The PDF that I'm considering here is a bivariate generalization of the expontial PDF, and I have found it on this article. It is pretty clear that sampling $p(x)$ is not a trivial task, so we are in a good situation to show in practice the first point of importance sampling. The easiest thing to do is to replace $p(x)$ with an uniform proposal
\begin{equation*}q(x)\triangleq \frac{1}{\textrm{Vol}(S)} \mathbb{1}_S(x)\end{equation*}
where $S$ is the support of $q(x)$ and $\textrm{Vol}(S)$ is its Lebesgue measure. In principle, we have to set $S$ such that it includes the support of $p(x)$, which in this case is the positive quadrant $\{\xi\geq 0, \eta\geq 0\}$. Based on the fact that $p(x)\to 0$ as $\xi$ or $\eta$ tend to $\infty$, I think that a reasonable choice of $S$ is
\begin{equation*}
S\triangleq [0,\xi_{\max}]\times [0,\eta_{\max}]
\end{equation*}

where, for a desired threshold $\tau>0$, the extrema of $S$ are given by
\begin{equation*}
\xi_{\max}\triangleq \frac{1}{\lambda_1} \ln\left(\frac{\lambda_1}{\tau}\right) \qquad
\eta_{\max}\triangleq \frac{1}{\lambda_2} \ln\left(\frac{\lambda_2}{\tau}\right)
\end{equation*}

In particular, I consider $\tau\triangleq 10^{-3}$, and so, by rounding the results, I consider $\xi_{\max}\triangleq 7$, $\eta_{\max}\triangleq 4$. At this point all the ingredients to run a simulation are defined. For different values of $M$, I obtain the following results.

enter image description here

The Gaussian curves are obtained by generating $N_{\text{MC}}\triangleq 100$ estimates $\hat{\bar{f}}_M^{(i)}$ with the sample size $M$. More precisely, center and variance of the Gaussians are
\begin{equation*}
\hat{\mu} \triangleq \frac{1}{N_{\textsf{MC}}} \sum_{i=1}^{N_{\textsf{MC}}} \hat{\bar{f}}_M^{(i)}
\qquad
\hat{\sigma}^2 \triangleq \frac{1}{N_{\textsf{MC}}-1} \sum_{i=1}^{N_{\textsf{MC}}} (\hat{\bar{f}}_M^{(i)}-\mu_M)^2
\end{equation*}

the confidence intervals are defined as $[\hat{\mu}-3\,\hat{\sigma},\hat{\mu}+3\,\hat{\sigma}]$.
Since $S$ does not cover the entire support of $p(x)$, such estimates are biased. However, such bias can be controlled, and in particular can be reduced by choosing smaller values of $\tau$.

Example 2

With this example, I want to see "in action" the second advantage of importance sampling.
The setup is the same of example 2, but now I search for a "more clever" proposal $q(x)$ with the objective of reducing the variance $\hat{\sigma}_M^2$ of the estimates obtained in example 1. The idea is to find a simple approximation of $q^*(x)$.

Firstly, by neglecting $r$, the underlying PDF simplifies in the product of two exponential PDFs with rates $\lambda_1,\lambda_2$. Then, the product between the such exponentials and $f(x)$ gives rise to a product of two Gamma PDFs with shape parameters $\alpha \triangleq 2$ and rates $\beta \triangleq \lambda_1, \lambda_2$. More formally, I consider the following approximation
\begin{equation*}\begin{aligned}
f(x)\,p(x) & \approx \lambda_1\,\xi\, \exp(-\lambda_1\,\xi)\mathbb{1}_{\xi \geq 0}(\xi)\,
\lambda_2\,\eta\, \exp(-\lambda_2\,\eta)\mathbb{1}_{\eta \geq 0}(\eta)\\
&= \frac{1}{\lambda_1\,\lambda_2}\,\Gamma(\xi; 2, \lambda_1)\,\Gamma(\eta; 2, \lambda_2)
\end{aligned}\end{equation*}

Now, from this approximation follows immediately the proposal
\begin{equation*}q(x)\triangleq \Gamma(\xi; 2, \lambda_1)\,\Gamma(\eta; 2, \lambda_2)\end{equation*}
but, in order to make the simulation as simple as possible, I don't want to sample Gamma distributions. Instead, I replace them with Gaussians by moments matching, i.e. I end up with the following proposal
\begin{equation*}q(x)\triangleq\mathcal{N}\left(\xi; \frac{2}{\lambda_1}, \frac{2}{\lambda_1^2}\right)\,\mathcal{N}\left(\eta; \frac{2}{\lambda_2}, \frac{2}{\lambda_2^2}\right)\end{equation*}
which, somewhat, tries to mimic $q^*(x)$. So, repeating the same experiment of example 1 gives the following results.
enter image description here
The new results should not be biased because the support of the current proposal covers entirely the support of $p(x)$. Hence, the new Gaussians do not share the same centers with the one obtained in the previous example.
As expected, it seems that the new proposal is able to generate estimates with smaller variance.

The implementation problem

If I'm not wrong, the benefit of reducing the variance by playing with the proposal, stems from the fact that we can reduce the number $M$ of particles to achieve a desired accuracy $\hat{\sigma}$, so that we speed up the estimation process.
Now, despite this fact seems to be true from the previous results, the gain in speed seems false. Probably I'm missing something from the theory, but I've found that my implementation based on uniform sampling is numerically stable, while my implementation based on Gaussian sampling is not. The bug is in the computation of the term
\begin{equation*}
\frac{p(x_i)}{q(x_i)}\,f(x_i)
\end{equation*}

because, with the Gaussian generator, it can happen that $x_i$ is generated in a "bad" region, so that $q(x_i)$ gets a small value in comparison to the product $p(x_i)\,f(x_i)$. Consequently, the previous term can get a huge value. If this happen, then I obtain $\sigma_{M}^2$ extremely large, so that relative confidence interval becomes meaningless.
A simple solution to solve this problem is to regenerate $x_i$ whenever the value $q(x_i)$ causes problems. But this is not a clever solution, because it slows down a lot the estimation process.

In the end, the estimator based on the uniform sampler (example 1) is way much more faster that the estimator based on the Gaussian sampler (example 2). It is more convenient to increase $M$ with the uniform, rather than running the "optimized" Gaussian with small $M$.
Indeed, it is true that the Gaussian sampler requires a small number of particles to generate a good estimate, but the price to generate a single particle is large, so that the effort spent to search for a good proposal is meaningless.

Questions

  1. Given that my background knowledge is free of missconceptions, there is something more that I should know about importance sampling?
  2. Do I have made some conceptual error in my examples?
  3. I suspect that the computation of $p(x)f(x)/q(x)$ is common problem. It is clear that if $q(x)$ is approximately proportional to $p(x)|f(x)|$ than the problem cannot occour, but how do we treat the cases where this situation is not true?

Best Answer

Importance sampling fails to produce reliable estimators when the random ratio \begin{equation} \frac{p(X)}{q(X)}\,f(X)\qquad X\sim q(\cdot) \end{equation} has infinite variance, i.e., when the function \begin{equation} x \longmapsto\frac{p^2(x)}{q(x)}\,f^2(x) \end{equation} is not $L^2$. This is the case in Example 2.

Since \begin{equation*}\begin{aligned} f(x)&\triangleq \xi\,\eta\\ p(x)&\triangleq [(\lambda_2+r\,\xi)(\lambda_1+r\,\eta)-r]\exp\left[-(\lambda_1\,\xi+\lambda_2\,\eta+r\,\xi\,\eta)\right]\,\mathbb{1}_{\xi,\eta \geq 0} \end{aligned} \end{equation*} the product $$ f(x)p(x)=\xi\,\eta[(\lambda_2+r\,\xi)(\lambda_1+r\,\eta)-r]\exp\left[-(\lambda_1\,\xi+\lambda_2\,\eta+r\,\xi\,\eta)\right]\,\mathbb{1}_{\xi,\eta \geq 0}$$ is upper bounded by $$ \tilde q(x)=\left[\lambda_1\,\lambda_2\,\xi\,\eta+\lambda_1\,r\,\xi^2\,\eta+\lambda_2\,r\,\eta^2\,\xi+r^2\,\xi^2\,\eta^2\right]\,\exp\left[-(\lambda_1\,\xi+\lambda_2\,\eta)\right]\,\mathbb{1}_{\xi,\eta \geq 0}$$ which is [up to a multiplicative constant] equal to a mixture $q(x)$ of four bivariate Gamma distributions, hence easily normalised and simulated. Indeed, \begin{align} &\left[\lambda_1\,\lambda_2\,\xi\,\eta+\lambda_1\,r\,\xi^2\,\eta+\lambda_2\,r\,\eta\,r^2+r^2\,\xi^2\,\eta^2\right]\,\exp\left[-(\lambda_1\,\xi+\lambda_2\,\eta)\right]\\ &\quad=\lambda_1\,\lambda_2\,\lambda_1^{-2}\,\lambda_2^{-2}\,\Gamma(\xi;2,\lambda_1)\,\Gamma(\eta;2,\lambda_2)\\ &\qquad+\lambda_1\,r\,\lambda_1^{-3}\,\lambda_2^{-2}\,\Gamma(\xi;3,\lambda_1)\,\Gamma(\eta;2,\lambda_2)\\ &\qquad+\lambda_2\,r\,\lambda_1^{-2}\,\lambda_2^{-3}\,\Gamma(\xi;2,\lambda_1)\,\Gamma(\eta;3,\lambda_2)\\ &\qquad+r^2\,\lambda_1^{-3}\,\lambda_2^{-3}\,\Gamma(\xi;3,\lambda_1)\,\Gamma(\eta;3,\lambda_2) \end{align} i.e. \begin{align} q(x)&= C \big[\lambda_1\,\lambda_2\,\Gamma(\xi;2\,\lambda_1)\,\Gamma(\eta;2,\lambda_2)\\ &\qquad+r\,\Gamma(\xi;3,\lambda_1)\,\Gamma(\eta;2,\lambda_2)\\ &\qquad+r\,\Gamma(\xi;2,\lambda_1)\,\Gamma(\eta;3,\lambda_2)\\ &\qquad+r^2\,\lambda_1^{-1}\,\lambda_2^{-1}\,\Gamma(\xi;3,\lambda_1)\,\Gamma(\eta;3,\lambda_1)\big] \end{align} with $$C^{-1}=\lambda_1\,\lambda_2+2r+r²\,\lambda_1^{-1}\,\lambda_2^{-1} =\left(\sqrt{\lambda_1\,\lambda_2}+r\big/\sqrt{\lambda_1\,\lambda_2} \right)^2$$

Since $f(x)p(x)/q(x)$ is bounded over $\mathbb R_+^2$, the variance of the importance sampling estimator is finite.

Implementing the importance sampling approach is relatively straightforward in this case and returns a value of 0.225 with low variability. Interestingly, bounding by a mixture $w(\cdot)$ of four Gamma densities also applies to the target density $p(\cdot)$, $$\forall x,\quad p(x)\le(1+r/\lambda_1\lambda_2)^2w(x)$$ i.e., allows for an accept-reject simulation of $p(\cdot)$, with the numerical value of the constant being $1.26$, which means a moderate rejection rate. However, the plain Monte Carlo estimate based on that accept-reject implementation (i) has larger variance and (ii) requires more time than the alternative importance sampling version, as shown by the following boxplot based on 100 replicas. Note that the value of the estimated integral differs from the one produced in the question by a scale of 2.

enter image description here

Related Question