Instead of using the score function, why do you not simply optimize for the highest reward and choose Policy*= Max(All actions with discounted rewards)?
You do not have the information in order to take that maximum at the start of learning. In order to know the expected return or discounted sum of future rewards, you need to of measured it whilst using an already optimal policy.
Iterating towards this goal with a policy based on best estimates so far, refining those estimates given the current policy (by acting in that policy and sampling results), then refining the policy based on better estimate is essentially how action-value-based methods work, such as Monte Carlo Control, SARSA or Q Learning. These are all RL solvers, but are not always the most efficient for a given problem.
The score function helps to calculate a sampled measure of the gradient of the expected return of a parametric policy with respect to its parameters. Which means you can use it to perform stochastic gradient ascent directly on a policy, increasing its performance (on average) without necessarily needing to know the action values. The REINFORCE algorithm does not use action values at all. However, algorithms which do, such as Actor-Critic, can be better, and still maintain benefits compared to using a pure action-value approach.
Which is better? It depends on the problem. Sometimes it is more efficient to express a policy as a parametric function of the state. A common example of this is when there are many actions, or action space is continuous. Getting action-value estimates for a large number of actions, and then finding the maximising action over them, is computationally expensive. In those scenarios, it will be more efficient to use a policy gradient method and the score function is needed to estimate the gradient.
Another common scenario where a direct policy refinement can be better is when the ideal policy is stochastic. E.g. in scissor/paper/stone game. Expressing this as maximising over action values is not stable - the agent will pick one action, until that is exploited against it, then pick another etc. Whilst an agent using policy gradient and a softmax action choice could learn optimal ratios in an environment like scissor/paper/stone - two such agents competing should converge in theory to the Nash equilibrium of equiprobable actions.
Conversely, sometimes action-value methods will be the more efficient choice. There might be a simpler relationship between optimal action value and state, than between policy and state. A good example of this might be a maze solver (with reward -1 per time step). The mapping between action value and state is just related to the distance to the exit. The mapping between policy and state has no obvious relation to the state, except when expressed as taking the action that minimises that distance.
The notation here is far more complicated than it needs to be, and I suspect this is contributing to the issue of understanding this method. To clarify the problem, I'm going to re-frame this in standard notation. I'm also going to remove reference to $x$, because the entire analysis is conditional on this value, so it adds nothing to the problem beyond complicating the notation.
You have a problem with a Gaussian random variable $Z \sim \text{N}(\mu(\phi), \sigma(\phi)^2)$, where the mean and variance depend on a parameter $\phi$. You can also define the error term $\epsilon \equiv (Z - \mu(\phi))/\sigma(\phi)$ which measures the number of standard deviations from the mean. Now, you want to compute the gradient of the expected value:
$$\begin{equation} \begin{aligned}
J(\phi) \equiv \mathbb{E}(r(Z))
&= \int \limits_\mathbb{R} r(z) \cdot \text{N}(z|\mu(\phi), \sigma(\phi)^2) \ dz \\[6pt]
&= \int \limits_\mathbb{R} r(\mu(\phi) + \epsilon \cdot \sigma(\phi)) \cdot \text{N}(\epsilon|0,1) \ dz. \\[6pt]
\end{aligned} \end{equation}$$
(The equivalence of these two integral expressions is a consequence of the change-of-variable formula for integrals.) Differentiating these expressions gives the two equivalent forms:
$$\begin{equation} \begin{aligned}
\nabla_\phi J(\phi)
&= \int \limits_\mathbb{R} r(z) \bigg( \nabla_\phi \ln \text{N}(z|\mu(\phi), \sigma(\phi)^2) \bigg) \cdot \text{N}(z|\mu(\phi), \sigma(\phi)^2) \ dz \\[6pt]
&= \int \limits_\mathbb{R} \bigg( \nabla_\phi r(\mu(\phi) + \epsilon \cdot \sigma(\phi)) \bigg) \cdot \text{N}(\epsilon|0,1) \ dz. \\[6pt]
\end{aligned} \end{equation}$$
Both of these expressions are valid expressions for the gradient of interest, and both can be approximated by corresponding finite sums from simulated values of the random variables in the expressions. To do this we can generate a finite set of values $\epsilon_1,...,\epsilon_M \sim \text{IID N}(0,1)$ and form the values $z_1,...,z_M$ that correspond to these errors. Then we can use one of the following estimators:
$$\begin{equation} \begin{aligned}
\nabla_\phi J(\phi) \approx \hat{E}_1(\phi) &\equiv \frac{1}{M} \sum_{j=1}^M r(z_j) \bigg( \nabla_\phi \ln \text{N}(z_j|\mu(\phi), \sigma(\phi)^2) \bigg), \\[10pt]
\nabla_\phi J(\phi) \approx \hat{E}_2(\phi) &\equiv \frac{1}{M} \sum_{j=1}^M \nabla_\phi r(\mu(\phi) + \epsilon_j \cdot \sigma(\phi)).
\end{aligned} \end{equation}$$
The speaker asserts (but does not demonstrate) that the variance of the second estimator is lower than the variance of the first. He claims that this is because the latter estimator uses direct information about the gradient of $r$ whereas the first estimator uses information about the gradient of the log-density for the normal distribution. Personally, without more knowledge of the nature of $r$, this seems to me to be an unsatisfying explanation, and I can see why you are confused by it. I doubt that this result would hold for all functions $r$, but perhaps within the context of that field, the function $r$ tends to have a gradient that is fairly insensitive to changes in the argument value.
Best Answer
1) It is ok to take the derivative of an expectation because an expectation is just an integral and we can differentiate an integral, as the author mentioned, we can use Leibniz's rule to justify taking the differentiation operator inside the integral.
2) We can't estimate the gradient with \begin{equation} \frac{1}{S}\sum_{s=1}^S f(z^{(s)})\nabla_{\theta}p(z;\theta)\quad z^{s}\sim p(z) \end{equation} because this is an estimator for an expectation. An expectation when written as an integral looks like \begin{equation} \mathbb{E}_{p(z)}[f(z)] = \int p(z) f(z) dz \end{equation} The integral we're trying to estimate is \begin{equation} \int \nabla_{\theta}p(z,\theta)f(z)dz \end{equation} which is not of the form $\int p(z) f(z) dz$ where $p(z)$ is a probability distribution function. This is because: just because $p(z, \theta)$ is a probability distribution function, this doesn't mean $\nabla_\theta p(z, \theta)$ is a probability distribution function.
Your given summation is actually estimating \begin{equation} \mathbb{E}_{p(z)} [f(z) \nabla_\theta p(z; \theta)] \neq \nabla_{\theta}\mathbb{E}_{p(z;\theta)}[f(z)] \end{equation}