[Math] Cross-Entropy loss in Reinforcement Learning

machine learning

In the context of supervised learning for classification using neural networks, when we are identifying the performance of an algorithm we can use cross-entropy loss, given by:
$$
L = -\sum_1^n log(\pi (f(x_i))_{y_i})
$$
Where $x_i$ is a vector datapoint, $\pi$ is a softmax function, $f$ is our nerual network, and $y_i$ refers to the correct class for that data point. So the result is that we will get the the $log$ of the normalised network output for the correct class. The smaller L is the more accurate our model will be. So we can view this as a minimisation problem, where we just need to minimise $L$ with respect to the network weights. However, if we now look at how this equation is used in reinforcement learning, where now the network outputs represent an action which is chosen to interact with an environment instead of a class, we get something like this:
$$
L = -\sum_1^n r_i log(\pi (f(x_i))_{a_i})
$$
Where $r_i$ is a scalar representing the reward at a time step, and $a_i$ is the corresponding chosen action. As we are now in the context of reinforcement learning $a_i$ is used because we don't know what the correct class action is, but by multiplying each network output by the reward, it has the effect where it pushes the gradient in right direction when we perform back propagation to our network.

Although this is how the differences in equations have been explained to me, there is one thing that doesn't quite add up. In the reinforcement learning version of the equation, $L$ now increases when the agent performs well, and decreases when the agent performes badly. Where a negative reward means punishment. Which makes me think this has become a maximisation problem, however, all the implementations I've seen still treat the equation as a minimisation problem. Why is that? Have I misunderstood the maths?

Best Answer

In the reinforcement learning algorithm, you are trying to maximize the expected reward under the policy. When you take the derivative in a stochastic, sampling context, this corresponds to maximizing the log probability of the action you actually took, times the returns/rewards that you received (see the REINFORCE algorithm, policy gradient, and score function estimators in general).

Note that the cross-entropy loss has a negative sign in front. Take the negative away, and maximize instead of minimizing. Now you are maximizing the log probability of the action times the reward, as you want.

So, minimizing the cross-entropy loss is equivalent to maximizing the probability of the target under the learned distribution.

Related Question