Importance-Sampling – On the Optimal Distribution for Importance Sampling

importance-sampling

Let's say $X$ is a rv, $p(x)$ is its pmf.
I want to importance-sample $\mu := \mathbb E[f(X)]$,
for some bounded function $0<f<1$,
using another distribution $q(x)$.
Then what I should do is summing
$$\hat\mu := \frac{1}{n} \sum_{k=1}^n \frac{f(Y_k)p(Y_k)}{q(Y_k)}$$
where $Y_k$ are generated using $q$.

According to
this lecture note,
there is an optimal $q$, denoted by $q'$,
that minimizes the variance of $\hat\mu$:
$$q'(x)\propto f(x)p(x)$$
If I sample using $q'$,
then all summands in the definition of $\hat\mu$ becomes constant.
So yeah, it does give me a variance-zero experience.
Really appreciate.

What I don't seem to understand is that,
if I now want to importance-sample $g := 1 – f$ instead,
then suddenly $q'$ is not the best distribution,
as $q'$ is no longer $\sim gp = (1 – f)p$.

Yet another example is that, if I want to sample $h := 100 + f$,
then $hp$ is just $100(p \pm \epsilon)$,
so the optimal $q$ is very close to $p$.
Or maybe it's telling me that I shouldn't bother.

How is it possible that the optimal choice of $q$
depends on an affine transformation I applied to $f$?
Does that mean that I should optimize over
the affine transformations that can be applied to $f$?

Best Answer

If you choose the importance distribution ($q$) to be proportional to $f(x)p(x)$, i.e., $\rho(x) := f(x)p(x)/\mu$, you achieve a zero variance estimator:

$$ \frac{1}{n} \sum_{k=1}^n \frac{f(x)p(x)}{f(x)p(x)/\mu} = \frac{\mu}{n} \sum_{k=1}^n \frac{f(x)p(x)}{f(x)p(x)} = \mu. $$ The idea essentially says the optimal estimator of the integral is obtained by drawing samples exactly from the integrand (normalized) and evaluating the weight, which is exactly equal to the integral itself.

RE first example: the optimal $q$ will no longer be $\rho(x) = f(x)p(x)/\mu$ because $$ \frac{1}{n}\sum_{i=1}^n \frac{(1-f(X_i))p(X_i)}{q(X_i)} = \frac{1}{n}\sum_{i=1}^n \frac{p(X_i)}{q(X_i)}q(X_i) - \frac{1}{n}\sum_{j=1}^n\frac{f(X_j)p(X_j)}{q(X_j)}, $$ you can see that if you use $\rho(x)$ as your proposal, the first term on the RHS will no longer be optimal; $\rho(x)$ is only optimal for the second term.

RE second example: In this case, $$ \frac{1}{n}\sum_{i=1}^n \frac{(100 + f(X_i))p(X_i)}{q(X_i)} = \frac{100}{n}\sum_{i=1}^n \frac{p(X_i)}{q(X_i)} + \frac{1}{n}\sum_{j=1}^n\frac{f(X_j)p(X_j)}{q(X_j)}, $$ the same idea applies; the first term on the RHS should use a $p$ as $q$.