[Math] Log of Softmax function Derivative.

derivativesgradient descentmachine learning

Could someone explain how that derivative was arrived at.

According to me, the derivative of $\log(\text{softmax})$ is
$$
\nabla\log(\text{softmax}) =
\begin{cases}
1-\text{softmax}, & \text{if $i=j$} \\
-\text{softmax}, & \text{if $i \neq j$}
\end{cases}
$$
Where did that expectation come from?
$\phi(s,a)$ is a vector, $\theta$ is also a vector. $\pi(s,a)$ denotes the probability of taking action a in state s.

Best Answer

The derivation of the softmax score function (aka eligibility vector) is as follows:

First, note that: $$\pi_\theta(s,a) = softmax = \frac{e^{\phi(s,a)^\intercal\theta}}{\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta}}$$

The important bit here is that the slide only identifies the proportionality, not the full softmax function which requires the normalization factor.

Continuing the derivation:

Using the $\log$ identity $\log(x/y) = \log(x) - \log(y)$ we can write $$\log(\pi_\theta(s,a)) = \log(e^{\phi(s,a)^\intercal\theta}) - \log(\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta}) $$

Now take the gradient:

$$\nabla_\theta\log(\pi_\theta(s,a)) = \nabla_\theta\log(e^{\phi(s,a)^\intercal\theta}) - \nabla_\theta\log(\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta})$$

The left term simplifies as follows:

$$left= \nabla_\theta\log(e^{\phi(s,a)^\intercal\theta}) = \nabla_\theta\phi(s,a)^\intercal\theta = \phi(s,a)$$

The right term simplifies as follows:

Using the chain rule: $$\nabla_x\log(f(x)) = \frac{\nabla_xf(x)}{f(x)}$$

We can write:

$$right = \nabla_\theta\log(\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta}) = \frac{\nabla_\theta\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta}}{\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta}}$$

Taking the gradient of the numerator we get:

$$right = \frac{\sum_{k=1}^N{\phi(s,a_k)}e^{\phi(s,a_k)^\intercal\theta}}{\sum_{k=1}^Ne^{\phi(s,a_k)^\intercal\theta}}$$

Substituting the definition of $\pi_\theta(s,a)$ we can simplify to:

$$right = \sum_{k=1}^N{\phi(s,a_k)}\pi_\theta(s,a_k)$$

Given the definition of Expected Value:

$$\mathrm{E}[X] = X \cdot P = x_1p_1+x_2p_2+ ... +x_np_n$$

Which in English is just the sum of each feature times its probability.

$$X = features = {\phi(s,a)}$$

$$P = probabilities =\pi_\theta(s,a)$$

So now we can write the expected value of the features:

$$right = \mathrm{E}_{\pi_\theta}[\phi(s,\cdot)]$$

where $\cdot$ means all possible actions.

Putting it all together: $$\nabla_\theta\log(\pi_\theta(s,a)) = left - right = \phi(s,a) - \mathrm{E}_{\pi_\theta}[\phi(s,\cdot)]$$

Best Answer

Related Solutions

Machine Learning – Derivative of Softmax Loss Function

[Math] the derivation of the derivative of softmax regression (or multinomial logistic regression)

Related Question