I recently came across the following curious identity:
$$\nabla_\theta \mathbb{E}_{x \sim D_\theta}[f(x)]
= \mathbb{E}_{x \sim D_\theta} [ \nabla_\theta \log(D_\theta(x)) f(x)],$$
where $D_\theta$ represents a probability distribution parametrized by $\theta$, $D_\theta(x)$ represents the probability that this distribution assigns to outcome $x$, $\nabla_\theta$ represents the gradient with respect to $\theta$, and $f$ represents some arbitrary function.
I can prove algebraically why this identity holds. However, I lack intuition. Is there any intuition for why this should hold? Perhaps something to understand why it is natural for the logarithm of the probability to appear inside the expected value, or an interpretation of these quantities that makes it natural why the identity would hold?
Here's the algebraic derivation:
$$\nabla_\theta \log(D_\theta(x)) f(x) = {\nabla_\theta D_\theta(x) \over D_\theta(x)} f(x),$$
so
$$\begin{align*}
\mathbb{E}_{x \sim D_\theta} [ \nabla_\theta \log(D_\theta(x)) f(x)]
&= \sum_x D_\theta(x) \nabla_\theta \log(D_\theta(x)) f(x)\\
&= \sum_x D_\theta(x) {\nabla_\theta D_\theta(x) \over D_\theta(x)} f(x)\\
&= \sum_x \nabla_\theta D_\theta(x) f(x)\\
&= \nabla_\theta \sum_x D_\theta(x) f(x)\\
&= \nabla_\theta \mathbb{E}_{x \sim D_\theta}[f(x)].
\end{align*}$$
Best Answer
It is the outcome of two tricks often used in analysis
The logarithm appears just because one wants to write the fraction of gradient $p$ by $p$ in a nice form.