Solved – need help understanding the benefit of score function estimator

reinforcement learningstochastic gradient descent

The score function estimator a.k.a REINFORCE policy gradient in reinforcement learning is (from http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/):

\begin{equation}
\nabla_{\theta}\mathbb{E}_{p(z;\theta)}[f(z)] = \mathbb{E}_{p(z,\theta)}[f(z)\nabla_{\theta}\log p(z;\theta)]
\end{equation}

The literature on this equation and its derivation is rich. However, I can't yet find any explanation for:

1) If we can compute the first expectation exactly (as function of $\theta$), is it ok to take its derivative?

2) Is it ok to sample from the first expectation? In the above link, the author wrote:
$$\nabla_{\theta}\mathbb{E}_{p(z;\theta)}[f(z)] = \int \nabla_{\theta}p(z,\theta)f(z)dz$$
where the derivative sign has been moved inside the integral sign (which the author wrote is (mathematically) valid). If then, can we estimate the gradient with:
$$\frac{1}{S}\sum_{s=1}^S f(z^{(s)})\nabla_{\theta}p(z;\theta)\quad z^{s}\sim p(z)$$
since the above is unambiguously different from score function estimator (where there derivative of the log of $p(z)$ is taken), I guess my question should be why can't we do that?

Thank you!

Best Answer

1) It is ok to take the derivative of an expectation because an expectation is just an integral and we can differentiate an integral, as the author mentioned, we can use Leibniz's rule to justify taking the differentiation operator inside the integral.

2) We can't estimate the gradient with \begin{equation} \frac{1}{S}\sum_{s=1}^S f(z^{(s)})\nabla_{\theta}p(z;\theta)\quad z^{s}\sim p(z) \end{equation} because this is an estimator for an expectation. An expectation when written as an integral looks like \begin{equation} \mathbb{E}_{p(z)}[f(z)] = \int p(z) f(z) dz \end{equation} The integral we're trying to estimate is \begin{equation} \int \nabla_{\theta}p(z,\theta)f(z)dz \end{equation} which is not of the form $\int p(z) f(z) dz$ where $p(z)$ is a probability distribution function. This is because: just because $p(z, \theta)$ is a probability distribution function, this doesn't mean $\nabla_\theta p(z, \theta)$ is a probability distribution function.

Your given summation is actually estimating \begin{equation} \mathbb{E}_{p(z)} [f(z) \nabla_\theta p(z; \theta)] \neq \nabla_{\theta}\mathbb{E}_{p(z;\theta)}[f(z)] \end{equation}

Best Answer

Related Solutions

Solved – Why do we need the score function in reinforcement learning

Solved – Variance of reparameterization trick and score function

Related Question