Solved – Gradient of the expectation of a function w.r.t. distribution parameters

approximate-inferenceautoencodersmachine learningvariational-bayes

In section 2.2 of Kingma & Welling's paper on variational auto-encoders authors write the following equality for the gradient of the expectation of a function with respect to the parameters of the probability distribution:
$$
\nabla_\phi \mathbb{E}_{q(\mathbf{z}|\phi)}[ f(\mathbf{z}) ] ~=~
\mathbb{E}_{q(\mathbf{z}|\phi)}\,[
f(\mathbf{z})\, \nabla_{q(\mathbf{z}|\phi)} \log q(\mathbf{z}|\phi) ]
\hspace{41mm}(1)
$$

The authors then note that the RHS can be approximated by
$$
\mathbb{E}_{q(\mathbf{z}|\phi)}\,[
f(\mathbf{z})\, \nabla_{q(\mathbf{z}|\phi)} \log q(\mathbf{z}|\phi) ]
~\simeq~
\frac{1}{L}\sum_{i=1}^L f(\mathbf{z}^{(i)})
\, \nabla_{q(\mathbf{z}^{(i)}|\phi)} \log q(\mathbf{z}^{(i)}|\phi)
\hspace{10mm}(2)
$$

where ${z}^{(i)} \sim q(\mathbf{z}|\phi)$.

My question is why the rewrite in equation (1) is even needed. For instance, could one approximate the LHS in equation (1) directly with something like:
$$
\nabla_{\phi} \left\{\frac{1}{L}
\sum_{i=1}^L f(\mathbf{z}^{(i)}) \, q(\mathbf{z}^{(i)}|\phi)
\right\}
~=~
\frac{1}{L}\sum_{i=1}^L f(\mathbf{z}^{(i)}) \,
\nabla_\phi \, q(\mathbf{z}^{(i)}|\phi)
\hspace{10mm}(3)
$$

Another question I have is whether the rewrite in equation (1) is an example of "differentiating under the integral" which is mentioned by Feynman in one of his memoires.

Best Answer

Without knowing anything in partuclar about auto-encoders: (1) $E_{q(z|\phi)}[f(Z)]$ this notation is not really 'standard' but usually means the expression $$ E_{q(z|\phi)}[f(Z)] = \int_Z f(z) q(z|\phi) dz $$ I.e. it is a quirky way of writing the (factorization) of the conditional expectation $E[f(Z)|\Phi=\phi]$ where the common distribution $f_{Z|\Phi}(z|\phi)$ is given by $q(z|\phi)$. Now we want to proceed like this: $$\partial_\phi E_{q(z|\phi)}[f(Z)] = \partial_\phi \int_Z f(z) q(z|\phi) dz = \int_Z f(z) \partial_\phi q(z|\phi) dz$$ It is mathematically not clear ad hoc that you may do this (and there are examples where this in general, i.e. $\partial_\phi \int f(x,\phi) dx = \int \partial_\phi f(x,\phi) dx$ is false!). The reason is that one needs to verify some assumptions that allows us to interchange two limits (differentiation and integration) but in probability theory these assumptions are usually met or assumed in turn. The assumption more precisely is that we can bound the functions $$ \frac{f(z) q(z|\phi + \delta) - f(z) q(z|\phi)}{\delta}$$ uniformly in all $z$ and $\delta$ ($\delta$ being close to $0$) by a single function that is still integrable (then we can use the Lebesgue dominated convergence theorem in order to interchange the limits). In your case it will (probably) work like this: Use the mean value theorem to write $$\frac{f(z) q(z|\phi + \delta) - f(z) q(z|\phi)}{\delta} = f(z) \partial_\phi q(x|\phi + \theta)$$ for some $\theta$ in the interval $[-\delta, \delta]$. Then find an integrable bound function for this derivative.

In short: (1): Yes, this seems to me as if it is one of these examples where you need to pull the differentiation into the integral.

My question is why the rewrite in equation (1) is even needed.

Let us assume that you directly approximite the lhs. Who tells you that this approximation is any good?? We have results of the form

''We can approximize the expected value $E[W]$ of a random value $W$ by $\frac{1}{L} \sum_{i=1}^L w_i \cdot f_W(w_i)$ this and that good if we have a sufficient amount of samples $w_1, ..., w_L$.''

So we could use these results on $\partial_\phi E_{q(z|\phi)}[f(Z)]$ but only if we know that this weird expression (the derivative of a random variable that somehow depends on a second parameter in this parameter) is an expected value of a random variable again and by pulling the derivative into the integral you do precisely that: Realize that this is the normal expectation of a random variable again, namely $E[f(Z)\cdot \partial_{q(z|\phi} \log q(z|\phi)]$.

In short: you need to rewrite (1) to (2) in order to see that the approximation is any good.

On your (3): Even in all the usual well known theorems (like the central limit theorem, the proof that the mean is a consistent estimator, etc) you need to divide the sum by the number of samples, otherwise the sum does not have a chance to converge but just vanishes towards infinity.

Related Question